I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?
Related
I am still trying to use Scrapy to collect data from pages on Weibo which need to be logged in to access.
I now understand that I need to use Scrapy FormRequests to get the login cookie. I have updated my Spider to try to make it do this, but it still isn't working.
Can anybody tell me what I am doing wrong?
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
def start_requests(self):
return [
scrapy.Request("https://www.weibo.com/u/2247704362/home?wvr=5&lf=reg", callback=self.parse_item)
]
def parse_item(self, response):
return scrapy.FormRequest.from_response(response, formdata={'user': 'user', 'pass': 'pass'}, callback=self.parse)
def parse(self, response):
print(response.body)
When I run this spider. Scrapy redirects from the URL under start_requests, and then returns the following error:
ValueError: No element found in <200 https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fweibo.com%2Fu%2F2247704362%2Fhome%3Fwvr%3D5%26lf%3Dreg&domain=.weibo.com&ua=php-sso_sdk_client-0.6.28&_rand=1585243156.3952>
Does that mean I need to get the spider to look for something other than Form data in the original page. How do I tell it to look for the cookie?
I have also tried a spider like this below based on this post.
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
login_url = "https://www.weibo.com/overseas"
test_url = 'https://www.weibo.com/u/2247704362/'
def start_requests(self):
yield scrapy.Request(url=self.login_url, callback=self.parse_login)
def parse_login(self, response):
return scrapy.FormRequest.from_response(response, formid="W_login_form", formdata={"loginname": "XXXXX", "password": "XXXXX"}, callback=self.start_crawl)
def start_crawl(self, response):
yield Request(self.test_url, callback=self.parse_item)
def parse_item(self, response):
print("Test URL " + response.url)
But it still doesn't work, giving the error:
ValueError: No element found in <200 https://www.weibo.com/overseas>
Would really appreciate any help anybody can offer as this is kind of beyond my range of knowledge.
I am trying to follow a list of links and scrap data from each link with a simple scrapy spider but I am having trouble.
In the scrapy shell when I recreate the script it sends the get request of the new url but when I run the crawl I do not get any data back from the link. The only data I get back is from the starting url that was scraped before going to the link.
How do I scrape data from the link?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "players"
start_urls = ['http://wiki.teamliquid.net/counterstrike/Portal:Teams']
def parse(self, response):
teams = response.xpath('//*[#id="mw-content-text"]/table[1]')
for team in teams.css('span.team-template-text'):
yield{
'teamName': team.css('a::text').extract_first()
}
urls = teams.css('span.team-template-text a::attr(href)')
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
def parse_team_info(self, response):
yield{
'Test': response.css('span::text').extract_first()
}
Instead of using
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
use
yield response.follow(url, callback=self.parse_team_info)
So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!
I'm reading scrapy tutorial from it's official page: https://doc.scrapy.org/en/latest/intro/tutorial.html
Here is the code what confused me:
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
The key point is the following function which defined inside the function parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
As the tutorial said, The parse_author callback defines a helper function to extract and cleanup the data from a CSS query. Anyone can help understand this? When will it be called?
I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.