I'm trying to scrape with phantomJS and selenium (and scrapy) into a list of links inside a website. I'm new to PhantomJS and Selenium so I'll ask here.
I think the website has a session timeout because I can scrape just the first of those links. Then I get this error:
NoSuchWindowException: Message: {"errorMessage":"Currently Window
handle/name is invalid
(closed?)","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"460","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:33038","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\"url\":
That's part of my code:
class bllanguage(scrapy.Spider):
handle_httpstatus_list = [302]
name = "bllanguage"
download_delay = 1
allowed_domains = ["http://explore.com/"]
f = open("link")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def __init__(self):
self.driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
def start_requests(self):
for u in self.start_urls:
r = scrapy.Request(url = u, dont_filter=True, callback=self.parse)
r.meta['dont_redirect'] = True
yield r
def parse(self, response):
self.driver.get(response.url)
#print response.url
search_field = []
Etc.
The session timeout problem is just my interpretation, I've seen other messages like that but none of them has a solution. What I would like to try is to "close" the request to each link inside the "link" file. I don't know if this is something PhantomJS does naturally or I have to insert something: I've seen there's a resourceTimeout setting. Is it the right thing to use and where can I put it inside my code?
Related
So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!
I am building a web scraper that downloads csv files from a website. I have to login to multiple user accounts in order to download all the files. I also have to navigate through several hrefs to reach these files for each user account. I've decided to use Scrapy spiders in order to complete this task. Here's the code I have so far:
I store the username and password info in a dictionary
def start_requests(self):
yield scrapy.Request(url = "https://external.lacare.org/provportal/", callback = self.login)
def login(self, response):
for uname, upass in login_info.items():
yield scrapy.FormRequest.from_response(
response,
formdata = {'username': uname,
'password': upass,
},
dont_filter = True,
callback = self.after_login
)
I then navigate through the web pages by finding all href links in each response.
def after_login(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
if 'listReports' in link:
url_join = response.urljoin(link)
return scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.reports
)
return
def reports(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_year
)
return
I then crawl through each href on the page and check the response to see if I can keep going. This portion of the code seems excessive to me, but I am not sure how else to approach it.
def select_year(self, response):
if '>2017' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_elist
)
return
def select_elist(self, response):
if '>Elists' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_company
)
Everything works fine, but as I said it does seem excessive to crawl through each href on the page. I wrote a script for this website in Selenium, and was able to select the correct hrefs by using the select_by_partial_link_text() method. I've searched for something comparable to that in scrapy, but it seems like scrapy navigation is based strickly on xpath and css name.
Is this how Scrapy is meant to be used in this scenario? Is there anything I can do to make the scraping process less redundant?
This is my first working scrapy spider, so go easy on me!
If you need to extract only links with certain substring in link text, you can use LinkExtractor with following XPath:
LinkExtractor(restrict_xpaths='//a[contains(text(), "substring to find")]').extract_links(response)
as LinkExtractor is the proper way to extract and process links in Scrapy.
Docs: https://doc.scrapy.org/en/latest/topics/link-extractors.html
I want to ask how about (do crawling) clicking next button(change number page of website) (then do crawling more till the end of page number) from this site
I've try to combining scrape with selenium,but its still error and says "line 22
self.driver = webdriver.Firefox()
^
IndentationError: expected an indented block"
I don't know why it happens, i think i code is so well.Anybody can resolve this problem?
This my source :
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]
def parse(self, response):
for article in response.css('.loop-panel'):
item = NowItem()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
#item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()`
This my capture of my program mate :
Ignoring the syntax and indentation errors you have an issue with your code logic in general.
What you do is create webdriver and never use it. What your spider does here is:
Create webdriver object.
Schedule a request for every url in self.start_urls, in your case it's only one.
Download it, make Response object and pass it to the self.parse()
Your parse method seems to find some xpaths and makes some items, so scrapy yields you some items that were found if any
Done
Your parse2 was never called and so your selenium webdriver was never used.
Since you are not using scrapy to download anything in this case you can just override start_requests()(<- that's where your spider starts) method of your spider to do the whole logic.
Something like:
from selenium import webdriver
import scrapy
from scrapy import Selector
class MySpider(scrapy.Spider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_url = "https://n0where.net/"
def start_requests(self):
driver = webdriver.Firefox()
driver.get(self.start_url)
while True:
next_url = driver.find_element_by_xpath(
'/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
# parse the body your webdriver has
self.parse(driver.page_source)
# click the button to go to next page
next_url.click()
except:
break
driver.close()
def parse(self, body):
# create Selector from html string
sel = Selector(text=body)
# parse it
for article in sel.css('.loop-panel'):
item = dict()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
# item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
This is a indentation error. Look the lines near the error:
def parse2(self, response):
self.driver.get(response.url)
The first of these two lines ends with a colon. So, the second line should be more indented than the first one.
There are two possible fixes, depending on what you want to do. Either add an indentation level to the second one:
def parse2(self, response):
self.driver.get(response.url)
Or move the parse2 function out of theinit` function:
def parse2(self, response):
self.driver.get(response.url)
def __init__(self):
self.driver = webdriver.Firefox()
# etc.
I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.
I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it?
Here's my code.
from scrapy.contrib.spiders import CrawlSpider
from scrapy import log
from scrapy import Request
class DmozSpider(CrawlSpider):
name = "dmoz"
handle_httpstatus_list = [302]
allowed_domains = ["http://www.exmaple.net/"])
f = open("C:\\python27\\1a.csv",'r')
url = 'http://www.exmaple.net/Query?indx='
start_urls = [url+row for row in f.readlines()]
def parse(self, response):
print response.meta.get('redirect_urls', [response.url])
print response.status
print (response.headers.get('Location'))
I've also tried something like that:
def parse(self, response):
return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url)
def parse_my_url(self, response):
print response.status
print (response.headers.get('Location'))
And it doesn't work either.
By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.
def start_requests(self):
requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse)) for u in self.start_urls]
return requests