This is the first time I ask question here. If something I got wrong, please forgive me.
And I am a newer in python for one month, I try to use the scrapy to learn something more about spider.
question is here:
def get_chapterurl(self, response):
item = DingdianItem()
item['name'] = str(response.meta['name']).replace('\xa0', '')
yield item
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
def get_chapter(self, response):
urls = re.findall(r'<td class="L">(.*?)</td>', response.text)
As you can see, I yield item and Requests at the same time, but the get_chapter function did not run the first line(I take a break point there), so where was I wrong?
Sorry for disturbing you.
I have google for a time, but get noting...
Your request gets filtered out.
Scrapy has in-built request filter that prevents you from downloading the same page twice (intended feature).
Lets say you are on http://example.com; this request you yield:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
tries to download http://example.com again. And if you look at the crawling log it should say something along the lines of "ignoring duplicate url http://example.com".
You can always ignore this feature by setting dont_filter=True parameter in your Request object, as so:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
dont_filter=True)
However! I'm having trouble understanding the intention of your code but it seems that you don't really want to download the same url twice.
You don't have to schedule a new request either, you can just call your callback with the request you already have:
response = response.replace(meta={'name': name_id}) # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):
yield result
# or if you are running python3:
yield from self.get_chapter(response):
Related
In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.
You can't use yield outside of a function imo.
I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.
Using scrapy. We have a decorator for logging Scrapy responses in utils/__init__.py and it prints what it finds, this is OK. Only we woule also like to know "how many links it found on the page". So as a result we have 2 log statements, resulting in 2 lines:
200: page found XXX
Found 23 products on category page XXX
Instead we would like to have 1 log statement, preferably somewhere central and not in every crawler (we have a lot! that prints
200: Page found, with # products - XXX
I dont think the log_response is able to access data 'inside' the method because that occurs later? Or is there a way to achieve this where we have 1 central method like log_response but that can also access the # links found so we can remove all the Found 23 products on category page XXX in individual crawlers.
Question: how can we centralize this and make it more generic so ther is no logc in the crawler class, but somewhere else / more central?
# decorator for logging Scrapy responses
def log_response(title, with_meta=False):
def real_decorator(f):
#wraps(f)
def wrap(self, response):
if not with_meta:
path = urlparse(response.url).path.strip('/')
self.logger.info(f'200 {title}: {path}')
this is how we report the # links found in the comment
#log_response('category')
def parse_category(self, response):
product_links = response.xpath('//a[#class="mainLink"]/#href').getall()
self.logger.info(f'Found {len(product_links)} products on category page (url {response.url}))')
The simplest way is probably doing your logging in a SpiderMiddleware's process_spider_output method, since it will be called every time a spider callback finishes.
Simply iterate over result, count the items, and make a logging call once your loop is over.
class LoggingMiddleware:
def process_spider_output(self, response, result, spider):
count = 0
for x in result:
yield x
# I think this is a sufficient check?
if not isinstance(x, scrapy.Request):
count += 1
spider.logger.info(f'{response.status}: Page found, with {count} products - {response.url}')
My issue is that when I added the redirect code from Can't get Scrapy to parse and follow 301, 302 redirects to my script, it solved the problem in that now it runs without errors, but now I'm not getting any output to my csv file. The problem is that in parse_links1, the if and else statements end with a 'yield' statement and this seems to be preventing the scrapy.Request line from implementing. This is fairly clear since in the previous iteration of this code, which only went down 2 levels of links, the code ran perfectly. But since the latest level has a redirect issue, I had to add that code in.
My code is like this:
class TurboSpider(scrapy.Spider):
name = "fourtier"
handle_httpstatus_list = [404]
start_urls = [
"https://ttlc.intuit.com/browse/cd-download-support"]
# def parse gets first set of links to use
def parse(self, response):
links = response.selector.xpath('//ul[contains(#class,
"list-unstyled")]//#href').extract()
for link in links:
yield scrapy.Request(link, self.parse_links,
dont_filter=True)
def parse_links(self, response):
tier2_text = response.selector.xpath('//a[contains(#class,
"dropdown-item-link")]//#href').extract()
for link in tier2_text:
schema = 'https://turbotax.intuit.com/'
links_to_use = urlparse.urljoin(schema, link)
yield scrapy.Request(links_to_use, self.parse_links1)
def parse_links1(self, response):
tier2A_text = response.selector.xpath('//a').extract()
for t in tier2A_text:
if response.status >= 300 and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location=
to_native_str(response.headers['location'].decode('latin1'))
request = response.request
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method
== 'HEAD':
redirected =
request.replace(url=redirected_url)
yield redirected
else:
redirected =
request.replace(url=redirected_url,
method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
yield scrapy.Request((t, self.parse_links2))
def parse_links2(self, response):
divs = response.selector.xpath('//div')
for p in divs.select('.//p'):
yield{'text':p.extract()}
What is wrong with the way I've set up the 'yield' in the parse_links1 function so that now I don't get any output? How to integrate several 'yield' commands together?
See Debugging Spiders.
Some logging statements should allow you to determine where something unexpected is happening (execution not reaching a certain line, some variable containing unexpected data), which in turn should help you either understanding what the issue is or writing a more specific question that is easier to answer.
I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.
The following django middleware function is used to identify active links in a django response object. If a link is active, its marked with a css class and the href attribute gets replaced by javascript:void(null);. Using this function, the last two lines before the return statement are so slow that i cant use it, further, no css, js and images are rendered. However, if i put these two calls into the for loop, everything is fine and fast. But, i dont want these two calls executed for each active link on page, instead i want them executed only once, and that doesnt work, i really cant see why and what the for loop has to do with it. Its no BeautifulSoup issue, because its the same with re.sub('\s+','',response.content) or the replace function. As far as i have investigated this, i can tell you that the very last line before the return statement is the slow one, as long as its not executed inside the for loop. I'm really excited about a possible explanation.
import re
from django_projects.projects.my_project.settings import SITE_NAME
from BeautifulSoup import BeautifulSoup
class PostRender():
def process_response(self, request, response):
link_pattern=re.compile('<a.*href="(http://%s)*%s".*>' % (SITE_NAME,request.path),re.IGNORECASE)
klass_pattern=re.compile('class="[^"]*"',re.IGNORECASE)
href_pattern=re.compile('href="(http://%s)*%s(\?.*)*"' % (SITE_NAME,request.path),re.IGNORECASE)
#find all active links
search=re.finditer(link_pattern ,response.content)
for res in search:
result=res.group()
klassname='class="active"'
if 'class' in result:
klass=re.search(klass_pattern,result).group().split('=')[1]
if len(klass) != 0:
klassname='class="%s %s"' % (klass[1:-1],'active')
link=re.sub(href_pattern,'href="javascript:void(null);"',re.sub(klass_pattern,klassname,result))
response.content=re.sub(result,link,response.content)
soup=BeautifulSoup(response.content)
response.content=soup.prettify()
return response