Scrapy Spider authenticate and iterate - python

Below is a Scrapy spider I have put together to pull some elements from a web page. I borrowed this solution from another Stack Overflow solution. It works, but I need more. I need to be able to walk the series of pages specified in the for loop inside the start_requests method after authenticating.
Yes, I did locate the Scrapy documentation discussing this along with a previous solution for something very similar. Neither one seems to make much sense. From what I can gather, I need to somehow create a request object and keep passing it along, but cannot seem to figure out how to do this.
Thank you in advance for your help.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class MyBasicSpider(BaseSpider):
name = "awBasic"
allowed_domains = ["americanwhitewater.org"]
def start_requests(self):
'''
Override BaseSpider.start_requests to crawl all reaches in series
'''
# for every integer from one to 5000
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))
def parse(self, response):
# create xpath selector object instance with response
hxs = HtmlXPathSelector(response)
# get part of url string
url = response.url
id = re.findall('/(\d{4})/', url)[0]
# selector 01
attribute01 = hxs.select('//div[#id="block_1"]/text()').re('([^,]*)')[0]
# selector for river section
attribute02 = hxs.select('//div[#id="block_1"]/div[1]/text()').extract()[0]
# print results
print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName)

You may have to approach the problem from a different angle:
first of all, scrape the main page; it contains a login form, so you can use FormRequest to simulate a user login; your parse method will likely look something like this:
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
in after_login you check if the authentication was successful, usually by scanning the response for error messages; if all went well and you're logged in, you can start generating requests for the pages you're after:
def after_login(self, response):
if "Login failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield Request(url='https://mycrawlsite.com/{0}/'.format(iStr),
callback=self.scrape_page)
scrape_page will be called with each of the pages you created a request for; there you can finally extract the information you need using XPath, regex, etc.
BTW, you shouldn't 0-pad numbers manually; format will do it for you if you use the right format specifier.

Related

extract_first only resulting in first item only .extract() does not work

I hope that you're all well. I am trying to learn Python through web-scraping. My project at the moment is to scrape data from a games store. I am initially wanting to follow a product link and print the response from each link. There are 60 game links on the page that I wish for Scrapy to follow. Below is the code.
import scrapy
class GameSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['365games.co.uk']
start_urls = ['https://www.365games.co.uk/3ds-games/']
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]')
for game in all_games:
game_url = game.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(game_url, callback=self.parse_game)
def parse_game(self, response):
print(response.status)
When I run this code scrapy runs and goes through the first link and prints the response, but stops. When I change the code to .extract() I get the following,
TypeError: Request url must be str or unicode, got list
The same applies with .get()/.getall() being that .get() only returns the first and .getall() displays the above error.
Any help would be greatly appreciated, but please be gentle I am trying to learn.
Thanks in advance and best regards,
Gav
The error is saying that you are passing a list instead of a string to scrapy.Request. This tells us that game_url is actually a list, when you want a string. You are very close to the right thing here, but I believe your problem is that you are looping in the wrong place. You first XPath returns just a single item, rather than a list of items. It is within this that you want to find your game_urls leading to
def parse(self, response):
product_grid = response.xpath('//*[#id="product_grid"]')
for game_url in product_grid.xpath('.//h3/a/#href').getall():
yield scrapy.Request(game_url, callback=self.parse_game)
You could also combine your xpath queries to directly to
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]//h3/a/#href')
for game_url in all_games.getall():
yield scrapy.Request(game_url, callback=self.parse_game)
In this case you could also use follow instead of creating a new Request directly. You can even directly pass a selector rather than a string to follow so you don't need to getall(), and it knows how to deal with <a> elements so you don't need the #href either!
def parse(self, response):
for game_url in response.xpath('//*[#id="product_grid"]//h3/a'):
yield response.follow(game_url, callback=self.parse_game)

How to crawl all webpages on website up to certain depth?

I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items
You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.
I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.

Scrapy Linkextractor duplicating(?)

I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.

Issue with loop in scrapy+selenium+phantomjs

I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems
class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]
def __init__(self):
service_args = ['--load-images=no',]
self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)
def parse(self, response):
self.driver.get(response.url)
item = loopitems()
for abc in range(2,50):
abc = str(abc)
jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
if jackson == True:
item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
yield item
else:
break
The urls (start_urls are dispatched from txt file):
http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim- Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7
I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form.
The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.
The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).
As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...
Thanks
If I correctly understood what you want to do, I think this one could help you.
Scrapy Crawl URLs in Order
The problem is that start_urls are not processed in order. They are passed to start_requests method and returned with a downloaded response to parse method. This is asynchronous.
Maybe this helps
#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()
#Do your thing
def parse(self, response):
#Do your thing
if len(self.other_urls) != 0
url = self.other_urls.pop()
yield Request(url=url, callback=self.parse)

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

Categories