I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.
Related
I am assigned to create a crawler by using python and scrapy to get the reviews of a specific hotel. I read quite a number of tutorials and guides, but still my code just keeps generating an empty CSV file.
Item.py
import scrapy
class AgodaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
StarRating = scrapy.Field()
Title = scrapy.Field()
Comments = scrapy.Field()
Agoda_reviews.py
import scrapy
class AgodaReviewsSpider(scrapy.Spider):
name = 'agoda_reviews'
allowed_domains = ['agoda.com']
start_urls = ['https://www.agoda.com/holiday-inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9']
def parse(self, response):
StarRating=response.xpath('//span[#class="Review-comment-leftScore"]/span/text()').extract()
Title=response.xpath('//span[#class="Review-comment-bodyTitle"]/span/text()').extract()
Comments=response.xpath('//span[#class="Review-comment-bodyText"]/span/text()').extract()
count = 0
for item in zip(StarRating, Title, Comments):
# create a dictionary to store the scraped info
scraped_data = {
'StarRating': item[0],
'Title': item[1],
'Comments': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_data
Can anybody please kindly let me know where the problems are? I am totally clueless...
Your results are empty because scrapy is receiving a response that does not have a lot of content. You can see this by starting a scrapy shell from your terminal and sending a request to the page you are trying to crawl.
scrapy shell 'https://www.agoda.com/holiday-
inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc
-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9'
Then you can view the response that scrapy received by running:
view(response)
That should open the response that was received and stored by scrapy in your browser. As you should see, there are no reviews to extract from.
Also, as you are trying to extract some information from span-elements, you can run response.css('span').extract() and you will see that there are some span-elements in the response but none of them has a class that has anything to do with Reviews.
So to sum up, agoda is sending you a quite empty response. As a consequence scrapy is extracting empty lists. Possible reasons could be: Agoda has figured out that you are trying to crawl their website, for example based on your user agent, and is therefore hiding the content from you - or they are using javascript to generate the content.
To solve your problem you should either use the agoda api, make yourself familiar with user agent spoofing or check out the selenium package which might help with javascript-heavy websites.
I am very new in Python and Scrapy and I have written a crawler in PyCharm as follow:
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
import re
class TutsplusItem(scrapy.Item):
title = scrapy.Field()
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
# We stored already crawled links in this list
crawledLinks = []
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
#if linkPattern.match(link) and not link in crawledLinks:
if not link in crawledLinks:
link = "http://www.bbc.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(#class, "media__link")]/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
However, when I run above codes, nothing prints on the screen! What is wrong in my code?
Put the code in a text file, name it to something like your_spider.py and run the spider using the runspider command:
scrapy runspider your_spider.py
You would typically start scrapy using scrapy crawl, which will hook everything up for you and start the crawling.
It also looks like your code is not properly indented (only one line inside parse when they all should be).
To run a spider from within Pycharm you need to configure "Run/Debug configuration" properly. Running your_spider.py as a standalone script wouldn't result in anything.
As mentioned by #stranac scrapy crawl is the way to go. With scrapy being a binary and crawl an argument of your binary.
Configure Run/Debug
In the main menu go to :
Run > Run Configurations...
Find the appropriate scrapy binary within your virtualenv and set its absolute path as Script.
This should look like something like this:
/home/username/.virtualenvs/your_virtualenv_name/bin/scrapy
In Scrapy parameters set up the parameters the binary scrapy will execute. In your case, you wan to start your spider. this is how this should look like:
crawl your_spider_name e.g. crawl tutsplus
Make sure that the Python intrepreter is the one where you setup Scrapy and other packages needed for your project.
Make sure that the working directory is the directory containing settings.pywhich is also generated by Scrapy.
From now on you should be able to Run and Debug your spiders from within Pycharm.
I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?
example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach
start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
I have read similar posts related to this question posted here on SO like 1, 2, 3, 4, 5, 6, 7, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.
This is what I have so far for the spider class
from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages
#allowed domain restricts the spider crawling
allowed_domains=["https://edumine.wordpress.com/"]
# in start_urls you have to specify the urls to crawl from
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]
'''
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
rules = [
Rule(SgmlLinkExtractor
(allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
]
'''
rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')
def parse_page(self, response):
hxs=Selector(response)
titles = hxs.xpath("//h1[#class='entry-title']")
items = []
with open("itemLog.csv","w") as f:
for title in titles:
item = BlogscraperItem()
item["post_title"] = title.xpath("//h1[#class='entry-title']//text()").extract()
item["post_time"] = title.xpath("//time[#class='entry-date']//text()").extract()
item["text"]=title.xpath("//p//text()").extract()
item["link"] = title.select("a/#href").extract()
items.append(item)
f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
print "#### \tTotal number of posts= ",len(items), " in category####"
f.close()
Any help or suggestions to solve it?
You have some things you can improve in your code and two problems you want to solve: reading posts, automatic crawling.
If you want to get the contents of a new blog post you have to re-run your spider. Otherwise you would have an endless loop. Naturally in this case you have to make sure that you do not scrape already scraped entries (database, read available files at spider start and so on). But you cannot have a spider which runs forever and waits for new entries. This is not the purpose.
Your approach to store the posts into a file is wrong. This means why do you scrape a list of items and then do nothing with them? And why do you save the items in the parse_page function? For this there are item pipelines, you should write one and do there the exporting. And the f.close() is not necessary because you use the with statement which does this for you at the end.
Your rules variable should throw an error because it is not iterable. I wonder if you even tested your code. And the Rule is too complex. You can simplify it to this:
rules = [Rule(LinkExtractor(allow='page/*'), follow=True, callback='parse_page'),]
And it follows every URL which has /page in it.
If you start your scraper you will see that the results are filtered because of your allowed domains:
Filtered offsite request to 'edumine.wordpress.com': <GET https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/>
To solve this change your domain to:
allowed_domains = ["edumine.wordpress.com"]
If you want to get other wordpress sites, change it simply to
allowed_domains = ["wordpress.com"]
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.
I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),