How to crawl all webpages on website up to certain depth?

How to crawl all webpages on website up to certain depth? - python

I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items

You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.

I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.

Related

Scrapy - Scrape multiple URLs using results from the first URL

I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this

I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request

For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.

# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work

Scrapy Spider authenticate and iterate

Below is a Scrapy spider I have put together to pull some elements from a web page. I borrowed this solution from another Stack Overflow solution. It works, but I need more. I need to be able to walk the series of pages specified in the for loop inside the start_requests method after authenticating.
Yes, I did locate the Scrapy documentation discussing this along with a previous solution for something very similar. Neither one seems to make much sense. From what I can gather, I need to somehow create a request object and keep passing it along, but cannot seem to figure out how to do this.
Thank you in advance for your help.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class MyBasicSpider(BaseSpider):
name = "awBasic"
allowed_domains = ["americanwhitewater.org"]
def start_requests(self):
'''
Override BaseSpider.start_requests to crawl all reaches in series
'''
# for every integer from one to 5000
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))
def parse(self, response):
# create xpath selector object instance with response
hxs = HtmlXPathSelector(response)
# get part of url string
url = response.url
id = re.findall('/(\d{4})/', url)[0]
# selector 01
attribute01 = hxs.select('//div[#id="block_1"]/text()').re('([^,]*)')[0]
# selector for river section
attribute02 = hxs.select('//div[#id="block_1"]/div[1]/text()').extract()[0]
# print results
print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName)

You may have to approach the problem from a different angle:
first of all, scrape the main page; it contains a login form, so you can use FormRequest to simulate a user login; your parse method will likely look something like this:
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
in after_login you check if the authentication was successful, usually by scanning the response for error messages; if all went well and you're logged in, you can start generating requests for the pages you're after:
def after_login(self, response):
if "Login failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield Request(url='https://mycrawlsite.com/{0}/'.format(iStr),
callback=self.scrape_page)
scrape_page will be called with each of the pages you created a request for; there you can finally extract the information you need using XPath, regex, etc.
BTW, you shouldn't 0-pad numbers manually; format will do it for you if you use the right format specifier.

Writing a crawler to parse a site in scrapy using BaseSpider

I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?

As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.

Scrapy: crawlspider not generating all links in nested callbacks

I have written a scrapy crawlspider to crawl a site with a structure like category page > type page > list page > item page. On the category page there are many categories of machines each of which has a type page with lots of types, each of the different types has a list of items, then finally each machine has a page with info about it.
My spider has a rule to get from the home page to the category page where I define the callback parsecatpage, this generates an item, grabs the category and yields a new request for each category on the page. I pass the item and the category name with request.meta and specify the callback is parsetype page.
Parsetypepage gets the item from response.meta then yields requests for each type and passes the item, and the concatenation of category and type along with it in the request.meta. The callback is parsemachinelist.
Parsemachinelist gets the item from response.meta then yields requests for each item on the list and passes the item, category/type, description via request.meta to the final callback, parsemachine. This gets the meta attributes and populates all the fields in the item using the info on the page and the info that was passed from the previous pages and finally yields an item.
If I limit this to a single category and type (with for example contains[#href, "filter=c:Grinders"] and contains[#href, "filter=t:Disc+-+Horizontal%2C+Single+End"]) then it works and there is a machine item for each machine on the final page. The problem is that once I allow the spider to scrapy all the categories and all the types it only returns scrapy items for the machines on the first of the final pages it gets to and once it has done that the spider is finished and doesn't get the other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from myspider.items import MachineItem
import urlparse
class MachineSpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/index.php']
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'),
)
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't categories = hxs.select('//a[contains(#href, "filter=c:Grinders")]')
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
req = Request(urlparse.urljoin(response.url,''.join(cat.select("#href").extract()).strip()),callback=self.parsetypepage)
req.meta['item'] = item
req.meta['machinecategory'] = ''.join(cat.select("./text()").extract())
yield req
def parsetypepage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End")]')
types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(#href, "filter=t:Lathe%2C+Production")]')
for typ in types:
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(typ.select("#href").extract()).strip()),callback=self.parsemachinelist)
req.meta['item'] = item
req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())])
yield req
def parsemachinelist(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('//tr[contains(td/a/#href, "action=searchdet")]'):
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip()),callback=self.parsemachine)
print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip())
req.meta['item'] = item
req.meta['descr'] = row.select('./td/div/text()').extract()
req.meta['machinecategory'] = response.meta['machinecategory']
yield req
def parsemachine(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['machinecategory'] = response.meta['machinecategory']
item['comp_name'] = 'Name'
item['description'] = response.meta['descr']
item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())])
item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract()
relative_image_url = hxs.select('//img[contains(#src, "custom/modules/images")]/#src')[0].extract()
abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip())
item['image_urls'] = [abs_image_url]
yield item
SPIDER = MachineSpider()
So for example the spider will find Grinders on the category page and go to the Grinder type page where it will find the Disc Horizontal Single End type, then it will go to that page and find the list of machines and go to each machines page and finally there will be an item for each machine. If you try and go to Grinders and Lathes though it will run through the Grinders fine then it will crawl the Lathes and Lathes type pages and stop there without generating the requests for the Lathes list page and the final Lathes pages.
Can anyone help with this? Why isn't the spider getting to the second (or third etc.) machine list page once there is more than one category of machine?
Sorry for the epic post, just trying to explain the problem!!
Thanks!!

You should print the url of the request, to be sure it's ok. Also you can try this version:
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
cat_url = urlparse.urljoin(response.url, cat.select("./#href").extract()[0])
print 'url:', cat_url # to see what's there
cat_name = cat.select("./text()").extract()[0]
req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name})
yield req

The problem was that the website is set up so that moving from the category to type page (and the following pages) occurs via filtering the results that are shown. This means that if the requests are performed depth first to the bottom of the query then it works (i.e. choose a category, then get all the types of that category, then get all the machines in each type then scrape the page of each machine) but if a request for the next type page is processed before the spider has got the urls for each machine in the first type then the urls are no longer correct and the spider reaches an incorrect page and cannot extract the info for the next step.
To solve the problem I defined a category setup callback which is called the first time only and gets a list of all categories called categories, then a category callback which is called from category setup which starts the crawl with a single category only using categories.pop(). Once the spider has got to the bottom of the nested callbacks and scraped all the machines in the list there is a callback back up to the category callback again (needed dont_follow=True in the Request) where categories.pop() starts the process again with the next category in the list until they are all done. This way each category is treated fully before the next is started and it works.
Thanks for your final comment, that got me thinking along the right lines and led me to the solution!

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()

After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan

I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to crawl all webpages on website up to certain depth? - python

I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.

Related

Scrapy - Scrape multiple URLs using results from the first URL

Scrapy Spider authenticate and iterate

Writing a crawler to parse a site in scrapy using BaseSpider

Scrapy: crawlspider not generating all links in nested callbacks

Scrapy - parse a page to extract items - then follow and store item url contents

Categories

Resources