Scraping a domain for links recursively using Scrapy

Scraping a domain for links recursively using Scrapy - python

Here is the code I'm using for scraping all the urls of a domain:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class UrlsSpider(scrapy.Spider):
name = 'urlsspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))
def parse(self, response):
for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
print link.url
yield scrapy.Request(link.url, callback=self.parse)
As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.
Any help on this matter will be very helpful.

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

Related

Python: why is in scrapy crawlspider not printing or doing anything?

I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?

Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.

scrapy rules do not call parsing method

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*
crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
crawl: scrapy crawl getbid -o 012916.csv

From the CrawlSpider docs:
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
Since your first rule will match all links, it will always be used and all other rules will be ignored.
Fixing the problem is as simple as switching the order of the rules.

Scrapy is Visiting same Url despite dont_filter=False

Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items

Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.

scrapy crawlspider output

I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).
So, my question is can I use this:
scrapy crawl dmoz -o items.csv
or do I have to create an Item Pipeline?
UPDATED, now with code!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/#href').extract() #this pulled down data in scrapy shell
return item

Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.
Your rule must call the parse_item. So, replace:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
with:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.
Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(#class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(#class, "comment")])[1]/font/text()').extract()
return item

crawling multiple webpages from a website

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc
This is my code. I want plenty of URLs to scrape using loop. So how am I suposed to these? I did put multiple urls in there but I didn't get output from all of them. Some URLs stop responding. So how can I get the data for sure using this code?

You code looks ok but are you sure that start_urls shouldn't start with http://
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
UPD
start_urls is a list of urls scrapy starts with. Usually it has one or two links. Rarely more.
This pages must have identical HTML structure because Scrapy spider process them the same way.
See if i put 4-5 url's in start_urls it gives output ok for first 2-3
url's.
I don't believe this because scrapy doesn't care how many links is start_urls list.
But it stops responding and also tell me how i can implement GUI for this.?
Scrapy has debug shell to test you code.

You just posted the code from the tutorial. What you should do is to actually read the whole documentation, especially the basic concept part. What you basically want is the crawl spider where you can define rules that the spider will follow and process with your given code.
To quote the docs with the example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[#id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[#id="item_description"]/text()').extract()
return item

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a domain for links recursively using Scrapy - python

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

Related

Python: why is in scrapy crawlspider not printing or doing anything?

scrapy rules do not call parsing method

Scrapy is Visiting same Url despite dont_filter=False

scrapy crawlspider output

crawling multiple webpages from a website

Categories

Resources