I'm reading scrapy tutorial from it's official page: https://doc.scrapy.org/en/latest/intro/tutorial.html
Here is the code what confused me:
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
The key point is the following function which defined inside the function parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
As the tutorial said, The parse_author callback defines a helper function to extract and cleanup the data from a CSS query. Anyone can help understand this? When will it be called?
Related
I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?
So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!
I am a Python novice and am trying to write a script to extract the data from this page. Using scrapy, I wrote the following code:
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
'http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rows in response.xpath("//tr"):
yield {
'text': rows.xpath(".//td/text()").extract().strip('. \n'),
}
However, this didn't scrape anything. Do you have any ideas ?
Thanks
The table on the page http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i is being dynamically added to the DOM by making request to http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0.
You should be scrapping the second link instead of first. As scrapy.Request will only return html source code and not the content added using javascript.
UPDATE
Here is the working code for extracting table data
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
"http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.css(".bcQuoteTable tbody tr"):
print row.xpath("td//text()").extract()
Make sure you edit your settings.py file and change ROBOTSTXT_OBEY = True to ROBOTSTXT_OBEY = False
I am building a web scraper that downloads csv files from a website. I have to login to multiple user accounts in order to download all the files. I also have to navigate through several hrefs to reach these files for each user account. I've decided to use Scrapy spiders in order to complete this task. Here's the code I have so far:
I store the username and password info in a dictionary
def start_requests(self):
yield scrapy.Request(url = "https://external.lacare.org/provportal/", callback = self.login)
def login(self, response):
for uname, upass in login_info.items():
yield scrapy.FormRequest.from_response(
response,
formdata = {'username': uname,
'password': upass,
},
dont_filter = True,
callback = self.after_login
)
I then navigate through the web pages by finding all href links in each response.
def after_login(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
if 'listReports' in link:
url_join = response.urljoin(link)
return scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.reports
)
return
def reports(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_year
)
return
I then crawl through each href on the page and check the response to see if I can keep going. This portion of the code seems excessive to me, but I am not sure how else to approach it.
def select_year(self, response):
if '>2017' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_elist
)
return
def select_elist(self, response):
if '>Elists' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_company
)
Everything works fine, but as I said it does seem excessive to crawl through each href on the page. I wrote a script for this website in Selenium, and was able to select the correct hrefs by using the select_by_partial_link_text() method. I've searched for something comparable to that in scrapy, but it seems like scrapy navigation is based strickly on xpath and css name.
Is this how Scrapy is meant to be used in this scenario? Is there anything I can do to make the scraping process less redundant?
This is my first working scrapy spider, so go easy on me!
If you need to extract only links with certain substring in link text, you can use LinkExtractor with following XPath:
LinkExtractor(restrict_xpaths='//a[contains(text(), "substring to find")]').extract_links(response)
as LinkExtractor is the proper way to extract and process links in Scrapy.
Docs: https://doc.scrapy.org/en/latest/topics/link-extractors.html
I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.