I want to write my custom scrapy link extractor for extracting links.
The scrapy documentation says it has two built-in extractors.
http://doc.scrapy.org/en/latest/topics/link-extractors.html
But i haven't seen any code example of how can i implement by custom link extractor, can someone give some example of writing custom extractor?
This is the example of custom link extractor
class RCP_RegexLinkExtractor(SgmlLinkExtractor):
"""High performant link extractor"""
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()
links_text = linkre.findall(response_text)
urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])
return [Link(url, text) for url, text in urlstext]
Usage
rules = (
Rule(
RCP_RegexLinkExtractor(
allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
# Regex explanation:
# [a-z]{2} - matches a two character state abbreviation
# [a-z]* - matches a state name
# [0-9]{4} - matches a 4 number unique webpage identifier
allow_domains=('realclearpolitics.com',),
),
callback='parseStatePolls',
# follow=None, # default
process_links='processLinks',
process_request='processRequest',
),
)
have a look at here https://github.com/jtfairbank/RCP-Poll-Scraper
I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor.
The reason why I decided to create a custom link extractor
I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such:
<a href="
/something/something.html
" />
Supposing the page that had this link was at:
http://example.com/something/page.html
Instead of transforming this href url into:
http://example.com/something/something.html
Scrapy transformed it into:
http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20
And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls.
I tried to use the process_value and process_links params of LxmlLinkExtractor, as suggested here without luck, so I decided to patch the method that processes relative urls.
Finding the original code
At the current version of Scrapy (1.0.3), the recommended link extractor is the LxmlLinkExtractor.
If you want to extend LxmlLinkExtractor, you should check out how the code goes on the Scrapy version that you are using.
You can probably open your currently used scrapy code location by running, from the command line (on OS X):
open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')
In the version that I use (1.0.3) the code of LxmlLinkExtractor is in:
scrapy/linkextractors/lxmlhtml.py
There I saw that the method I needed to adapt was _extract_links() inside LxmlParserLinkExtractor, that is then used by LxmlLinkExtractor.
So I extended LxmlLinkExtractor and LxmlParserLinkExtractor with slightly modified classes called CustomLinkExtractor and CustomLxmlParserLinkExtractor. The single line I modified is commented out.
# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")
# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
for el, attr, attr_val in self._iter_links(selector._root):
# Original method was:
# attr_val = urljoin(base_url, attr_val)
# So I just added a .strip()
attr_val = urljoin(base_url, attr_val.strip())
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
# Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
And when defining the rules, I use CustomLinkExtractor:
from scrapy.spiders import Rule
rules = (
Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),
)
I've found LinkExtractor examples also at
https://github.com/geekan/scrapy-examples
and
https://github.com/mjhea0/Scrapy-Samples
(edited after people could not find the required info at the links above)
more precisely at https://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Code and https://github.com/mjhea0/Scrapy-Samples/search?utf8=%E2%9C%93&q=linkextractors
Related
What I'm trying to do is have it replace all urls from an html file.
This is what I have done but I realized it also deletes everything else after it.
s = 'https://12345678.com/'
site_link = "google"
print(s[:8] + site_link)
It would return as https://google
I have made a code sample.
In this, link_template is a template for a link, and ***** represents where your site_name will go. It might look a bit confusing at first, but if you run it you'll understand.
# change this to change your URL
link_template = 'https://*****.com/'
# a site name, from your example
site_name = 'google'
# this is your completed link
site_link = site_name.join(link_template.split('*****'))
# prints the result
print(site_link)
Additionally, you can make a function for it:
def name_to_link(link_template,site_name):
return site_name.join(link_template.split('*****'))
And then you can use the function like this:
link = name_to_link('https://translate.*****.com/','google')
print(link)
I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.
The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.
What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)
Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)
for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item
On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]
I am using scrapy to crawl several websites. My spider isn't allowed to jump across domains. In this scenario, redirects make the crawler stop immediately. In most cases I know how to handle it, but this is a weird one.
The culprit is: http://www.cantonsd.org/
I checked its redirect pattern with http://www.wheregoes.com/ and it tells me it redirects to "/". This prevents the spider to enter its parse function. How can I handle this?
EDIT:
The code.
I invoke the spider using the APIs provided by scrapy here: http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
The only difference is that my spider is custom. It is created as follows:
spider = DomainSimpleSpider(
start_urls = [start_url],
allowed_domains = [allowed_domain],
url_id = url_id,
cur_state = cur_state,
state_id_url_map = id_url,
allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.IGNORECASE),
tags = ('a', 'area', 'frame'),
attrs = ('href', 'src'),
response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"],
state_abbr = state_abbrs[cur_state]
)
I think the problem is that the allowed_domains sees that / is not part of the list (which contains only cantonsd.org) and shuts down everything.
I'm not reporting the full spider code because it is not invoked at all, so it can't be the problem.
I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.