Scrapy Deploy Doesn't Match Debug Result - python

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!

I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

Related

Scrapy - decorator for responses, can it access data 'inside' the method?

Using scrapy. We have a decorator for logging Scrapy responses in utils/__init__.py and it prints what it finds, this is OK. Only we woule also like to know "how many links it found on the page". So as a result we have 2 log statements, resulting in 2 lines:
200: page found XXX
Found 23 products on category page XXX
Instead we would like to have 1 log statement, preferably somewhere central and not in every crawler (we have a lot! that prints
200: Page found, with # products - XXX
I dont think the log_response is able to access data 'inside' the method because that occurs later? Or is there a way to achieve this where we have 1 central method like log_response but that can also access the # links found so we can remove all the Found 23 products on category page XXX in individual crawlers.
Question: how can we centralize this and make it more generic so ther is no logc in the crawler class, but somewhere else / more central?
# decorator for logging Scrapy responses
def log_response(title, with_meta=False):
def real_decorator(f):
#wraps(f)
def wrap(self, response):
if not with_meta:
path = urlparse(response.url).path.strip('/')
self.logger.info(f'200 {title}: {path}')
this is how we report the # links found in the comment
#log_response('category')
def parse_category(self, response):
product_links = response.xpath('//a[#class="mainLink"]/#href').getall()
self.logger.info(f'Found {len(product_links)} products on category page (url {response.url}))')
The simplest way is probably doing your logging in a SpiderMiddleware's process_spider_output method, since it will be called every time a spider callback finishes.
Simply iterate over result, count the items, and make a logging call once your loop is over.
class LoggingMiddleware:
def process_spider_output(self, response, result, spider):
count = 0
for x in result:
yield x
# I think this is a sufficient check?
if not isinstance(x, scrapy.Request):
count += 1
spider.logger.info(f'{response.status}: Page found, with {count} products - {response.url}')

Item serializers don't work. Function never gets called

I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.
The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.

No internal method call with scrapy

I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.
I tried already a few different things:
Called a instance method with self.sendQueryHash(response, tagName, afterHash)
Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent
Removed the method call and it worked. I saw the sendQueryHash output on the logger.
import scrapy
import re
import json
import logging
class TestpostSpider(scrapy.Spider):
name = 'testPost'
allowed_domains = ['test.com']
tags = [
"this"
,"that" ]
def start_requests(self):
requests = []
for i, value in enumerate(self.tags):
url = "https://www.test.com/{}/".format(value)
requests.append(scrapy.Request(
url,
meta={'cookiejar': i},
callback=self.parsefirstAccess))
return requests
def parsefirstAccess(self, response):
self.logger.info("parsefirstAccess")
jsonData = response.text
# That call works fine
tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
yield {
'json':jsonData,
'requestTime':int(round(time.time() * 1000)),
'requestNumber':0
}
if not hasNext:
self.logger.info("hasNext is false")
# No more data available stop processing
return
else:
self.logger.info("hasNext is true")
# Send request to get the query hash of the current tag
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
## 3.
def sendQueryHash(self, response, tagName, afterHash):
self.logger.info("sendQueryHash")
request = scrapy.Request(
"https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parseQueryHash,
dont_filter=True)
request.cb_kwargs['tagName'] = tagName
request.cb_kwargs['afterHash'] = afterHash
yield request
def extractFirstNextPageData(self, json):
return "data1", True, "data3"
I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash and def sendQueryHash out.
That's only one example of the behavior what I don't expect.
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
will just create a generator that you do nothing with. You need to make sure you yield your Request back to the scrapy engine. Since it is just a single request that is returned you should be able to use return instead of yield from sendQueryHash and then directly yield the Request by replacing the above line with
yield self.sendQueryHash(response, tagName, afterHash)

python yield function with callback args

This is the first time I ask question here. If something I got wrong, please forgive me.
And I am a newer in python for one month, I try to use the scrapy to learn something more about spider.
question is here:
def get_chapterurl(self, response):
item = DingdianItem()
item['name'] = str(response.meta['name']).replace('\xa0', '')
yield item
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
def get_chapter(self, response):
urls = re.findall(r'<td class="L">(.*?)</td>', response.text)
As you can see, I yield item and Requests at the same time, but the get_chapter function did not run the first line(I take a break point there), so where was I wrong?
Sorry for disturbing you.
I have google for a time, but get noting...
Your request gets filtered out.
Scrapy has in-built request filter that prevents you from downloading the same page twice (intended feature).
Lets say you are on http://example.com; this request you yield:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
tries to download http://example.com again. And if you look at the crawling log it should say something along the lines of "ignoring duplicate url http://example.com".
You can always ignore this feature by setting dont_filter=True parameter in your Request object, as so:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
dont_filter=True)
However! I'm having trouble understanding the intention of your code but it seems that you don't really want to download the same url twice.
You don't have to schedule a new request either, you can just call your callback with the request you already have:
response = response.replace(meta={'name': name_id}) # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):
yield result
# or if you are running python3:
yield from self.get_chapter(response):

How can i write my custom link extractor in scrapy python

I want to write my custom scrapy link extractor for extracting links.
The scrapy documentation says it has two built-in extractors.
http://doc.scrapy.org/en/latest/topics/link-extractors.html
But i haven't seen any code example of how can i implement by custom link extractor, can someone give some example of writing custom extractor?
This is the example of custom link extractor
class RCP_RegexLinkExtractor(SgmlLinkExtractor):
"""High performant link extractor"""
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()
links_text = linkre.findall(response_text)
urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])
return [Link(url, text) for url, text in urlstext]
Usage
rules = (
Rule(
RCP_RegexLinkExtractor(
allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
# Regex explanation:
# [a-z]{2} - matches a two character state abbreviation
# [a-z]* - matches a state name
# [0-9]{4} - matches a 4 number unique webpage identifier
allow_domains=('realclearpolitics.com',),
),
callback='parseStatePolls',
# follow=None, # default
process_links='processLinks',
process_request='processRequest',
),
)
have a look at here https://github.com/jtfairbank/RCP-Poll-Scraper
I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor.
The reason why I decided to create a custom link extractor
I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such:
<a href="
/something/something.html
" />
Supposing the page that had this link was at:
http://example.com/something/page.html
Instead of transforming this href url into:
http://example.com/something/something.html
Scrapy transformed it into:
http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20
And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls.
I tried to use the process_value and process_links params of LxmlLinkExtractor, as suggested here without luck, so I decided to patch the method that processes relative urls.
Finding the original code
At the current version of Scrapy (1.0.3), the recommended link extractor is the LxmlLinkExtractor.
If you want to extend LxmlLinkExtractor, you should check out how the code goes on the Scrapy version that you are using.
You can probably open your currently used scrapy code location by running, from the command line (on OS X):
open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')
In the version that I use (1.0.3) the code of LxmlLinkExtractor is in:
scrapy/linkextractors/lxmlhtml.py
There I saw that the method I needed to adapt was _extract_links() inside LxmlParserLinkExtractor, that is then used by LxmlLinkExtractor.
So I extended LxmlLinkExtractor and LxmlParserLinkExtractor with slightly modified classes called CustomLinkExtractor and CustomLxmlParserLinkExtractor. The single line I modified is commented out.
# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")
# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
for el, attr, attr_val in self._iter_links(selector._root):
# Original method was:
# attr_val = urljoin(base_url, attr_val)
# So I just added a .strip()
attr_val = urljoin(base_url, attr_val.strip())
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
# Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
And when defining the rules, I use CustomLinkExtractor:
from scrapy.spiders import Rule
rules = (
Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),
)
I've found LinkExtractor examples also at
https://github.com/geekan/scrapy-examples
and
https://github.com/mjhea0/Scrapy-Samples
(edited after people could not find the required info at the links above)
more precisely at https://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Code and https://github.com/mjhea0/Scrapy-Samples/search?utf8=%E2%9C%93&q=linkextractors

Categories