My issue is that when I added the redirect code from Can't get Scrapy to parse and follow 301, 302 redirects to my script, it solved the problem in that now it runs without errors, but now I'm not getting any output to my csv file. The problem is that in parse_links1, the if and else statements end with a 'yield' statement and this seems to be preventing the scrapy.Request line from implementing. This is fairly clear since in the previous iteration of this code, which only went down 2 levels of links, the code ran perfectly. But since the latest level has a redirect issue, I had to add that code in.
My code is like this:
class TurboSpider(scrapy.Spider):
name = "fourtier"
handle_httpstatus_list = [404]
start_urls = [
"https://ttlc.intuit.com/browse/cd-download-support"]
# def parse gets first set of links to use
def parse(self, response):
links = response.selector.xpath('//ul[contains(#class,
"list-unstyled")]//#href').extract()
for link in links:
yield scrapy.Request(link, self.parse_links,
dont_filter=True)
def parse_links(self, response):
tier2_text = response.selector.xpath('//a[contains(#class,
"dropdown-item-link")]//#href').extract()
for link in tier2_text:
schema = 'https://turbotax.intuit.com/'
links_to_use = urlparse.urljoin(schema, link)
yield scrapy.Request(links_to_use, self.parse_links1)
def parse_links1(self, response):
tier2A_text = response.selector.xpath('//a').extract()
for t in tier2A_text:
if response.status >= 300 and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location=
to_native_str(response.headers['location'].decode('latin1'))
request = response.request
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method
== 'HEAD':
redirected =
request.replace(url=redirected_url)
yield redirected
else:
redirected =
request.replace(url=redirected_url,
method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
yield scrapy.Request((t, self.parse_links2))
def parse_links2(self, response):
divs = response.selector.xpath('//div')
for p in divs.select('.//p'):
yield{'text':p.extract()}
What is wrong with the way I've set up the 'yield' in the parse_links1 function so that now I don't get any output? How to integrate several 'yield' commands together?
See Debugging Spiders.
Some logging statements should allow you to determine where something unexpected is happening (execution not reaching a certain line, some variable containing unexpected data), which in turn should help you either understanding what the issue is or writing a more specific question that is easier to answer.
Related
I have a silly question that doesn't let me run the spider. Every time I run the spider, I get IndentationError for the last "}" at the end of my spider code after "yield" and I cannot find out WHY. Can someone help me out with the problem? Thanks a lot!!
Here is my spider:
-- coding: utf-8 --
import scrapy
import json
import logging
import urlparse
class ArtsPodcastsSpider(scrapy.Spider):
name = 'arts_podcasts'
allowed_domains = ['www.castbox.fm']
def start_requests(self):
try:
if response.request.meta['skip']:
skip=response.request.meta['skip']
else:
skip=0
while skip < 201:
url = 'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1'
split_url = urlparse.urlsplit(url)
path = split_url.path
path.split('&')
path.split('&')[:-5]
'&'.join(path.split('&')[:-5])
parsed_query = urlparse.parse_qs(split_url.query)
query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
query['skip'] = skip
updated = split_url._replace(path='&'.join(base_path.split('&')[:-5]+['limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1', '']),
query=urllib.urlencode(query, doseq=True))
updated_url=urlparse.urlunsplit(updated)
yield scrapy.Request(url= updated_url, callback= self.parse_id, meta={'skip':skip})
def parse_id(self, response):
skip=response.request.meta['skip']
data=json.loads(response.body)
category=data.get('data').get('category').get('name')
arts_podcasts=data.get('data').get('list')
for arts_podcast in arts_podcasts:
yield scrapy.Request(url='https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={0}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'.format(arts_podcast.get('list')[2].get('cid')), meta={'category':category,'skip':skip}, callback= self.parse)
def parse(self, response):
skip=response.request.meta['skip']
category=response.request.meta['category']
arts_podcast=json.loads(response.body).get('data')
yield scrapy.Request(callback=self.start_requests,meta={'skip':skip+1})
yield{
'title':arts_podcast.get('title'),
'category':arts_podcast.get('category'),
'sub_category':arts_podcast.get('categories')
}
Thank you!
The error is having a try without a matching except or finally.
I would expect this to result in SyntaxError, but I'm guessing python detects that you're back to the original indentation of the try statement before it figures out there is no matching except/finally.
There are other errors, such as accessing unexistant response in start_requests and parsing methods' indentation being wrong...
I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.
I tried already a few different things:
Called a instance method with self.sendQueryHash(response, tagName, afterHash)
Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent
Removed the method call and it worked. I saw the sendQueryHash output on the logger.
import scrapy
import re
import json
import logging
class TestpostSpider(scrapy.Spider):
name = 'testPost'
allowed_domains = ['test.com']
tags = [
"this"
,"that" ]
def start_requests(self):
requests = []
for i, value in enumerate(self.tags):
url = "https://www.test.com/{}/".format(value)
requests.append(scrapy.Request(
url,
meta={'cookiejar': i},
callback=self.parsefirstAccess))
return requests
def parsefirstAccess(self, response):
self.logger.info("parsefirstAccess")
jsonData = response.text
# That call works fine
tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
yield {
'json':jsonData,
'requestTime':int(round(time.time() * 1000)),
'requestNumber':0
}
if not hasNext:
self.logger.info("hasNext is false")
# No more data available stop processing
return
else:
self.logger.info("hasNext is true")
# Send request to get the query hash of the current tag
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
## 3.
def sendQueryHash(self, response, tagName, afterHash):
self.logger.info("sendQueryHash")
request = scrapy.Request(
"https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parseQueryHash,
dont_filter=True)
request.cb_kwargs['tagName'] = tagName
request.cb_kwargs['afterHash'] = afterHash
yield request
def extractFirstNextPageData(self, json):
return "data1", True, "data3"
I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash and def sendQueryHash out.
That's only one example of the behavior what I don't expect.
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
will just create a generator that you do nothing with. You need to make sure you yield your Request back to the scrapy engine. Since it is just a single request that is returned you should be able to use return instead of yield from sendQueryHash and then directly yield the Request by replacing the above line with
yield self.sendQueryHash(response, tagName, afterHash)
This is the first time I ask question here. If something I got wrong, please forgive me.
And I am a newer in python for one month, I try to use the scrapy to learn something more about spider.
question is here:
def get_chapterurl(self, response):
item = DingdianItem()
item['name'] = str(response.meta['name']).replace('\xa0', '')
yield item
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
def get_chapter(self, response):
urls = re.findall(r'<td class="L">(.*?)</td>', response.text)
As you can see, I yield item and Requests at the same time, but the get_chapter function did not run the first line(I take a break point there), so where was I wrong?
Sorry for disturbing you.
I have google for a time, but get noting...
Your request gets filtered out.
Scrapy has in-built request filter that prevents you from downloading the same page twice (intended feature).
Lets say you are on http://example.com; this request you yield:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
tries to download http://example.com again. And if you look at the crawling log it should say something along the lines of "ignoring duplicate url http://example.com".
You can always ignore this feature by setting dont_filter=True parameter in your Request object, as so:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
dont_filter=True)
However! I'm having trouble understanding the intention of your code but it seems that you don't really want to download the same url twice.
You don't have to schedule a new request either, you can just call your callback with the request you already have:
response = response.replace(meta={'name': name_id}) # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):
yield result
# or if you are running python3:
yield from self.get_chapter(response):
import json
import scrapy
class SpidyQuotesSpider(scrapy.Spider):
name = 'hotelspider'
start_urls = [
'https://tr.hotels.com/search/listings.json?destination-id=1648683&q-check-out=2016-10-22&q-destination=Didim,+T%C3%BCrkiye&q-room-0-adults=2&pg=2&q-rooms=1&start-index=7&q-check-in=2016-10-21&resolved-location=CITY:1648683:UNKNOWN:UNKNOWN&q-room-0-children=0&pn=1'
]
def parse(self, response):
myresponse = json.loads(response.body)
data = myresponse.get('data')
body = data.get('body')
searchresults = body.get('searchResults')
for item in searchresults.get('results', []):
yield {
'text': item[0]['altText']
}
this is the screenshot of the error
I always get error when I run this script. Can anybody help me where I am doing wrong ?
I can't seem to reproduce your error but upon copying your code, I got a key error which pertains to your yield statement. See the code below:
import scrapy
import json
class SpidyQuotesSpider(scrapy.Spider):
name = "hotelspider"
allowed_domains = ["tr.hotels.com"]
start_urls = (
'https://tr.hotels.com/search/listings.json?destination-id=1648683&q-check-out=2016-10-22&q-destination=Didim,+T%C3%BCrkiye&q-room-0-adults=2&pg=2&q-rooms=1&start-index=7&q-check-in=2016-10-21&resolved-location=CITY:1648683:UNKNOWN:UNKNOWN&q-room-0-children=0&pn=1',
)
def parse(self, response):
myresponse = json.loads(response.body)
data = myresponse.get('data')
body = data.get('body')
searchresults = body.get('searchResults')
for item in searchresults.get('results', []):
yield {
'text': item['altText']
}
Make sure you are indenting using the same amount of spaces or just use TAB. Though the indentation shown in your code seems fine. Try pasting mine and see what comes up.
You are mixing spaces and tabs characters in your spider code (I copied your code from the "edit" function on your question):
Quoting Wikipedia, "Python uses whitespace to delimit control flow blocks". Indentation is crucial and you need to stick to either spaces or tabs. Mixing the 2 will lead to these IndentationErrors.
Try to make it like so:
I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.