I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :
Add the login page URL to the scrapy start_urls list
Use the formrequest from response method to post the username and password to get authenticated.
Once logged in,request for the desired page to be scraped
Pass this response to the Selenium Webdriver to click buttons on the page.
Once the buttons are clicked and a new webpage is rendered,capture the result.
The code that I have thus far is as follows:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest, Request
from selenium import webdriver
import time
class LoginSpider(BaseSpider):
name = "sel_spid"
start_urls = ["http://www.example.com/login.aspx"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
return FormRequest.from_response(response,
formdata={'User': 'username', 'Pass': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
if "Log Out" in response.body:
self.log("Successfully logged in")
scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
yield Request(url=scrape_url, callback=self.parse_page)
else:
self.log("Bad credentials")
def parse_page(self, response):
self.driver.get(response.url)
next = self.driver.find_element_by_class_name('dxWeb_pNext')
next.click()
time.sleep(2)
# capture the html and store in a file
The 2 roadblocks i have hit till now are:
Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.
I don't know how to achieve step 5
Any help will be greatly appreciated
I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):
scrape_url = "http://www.example.com/authen_handler.aspx"
driver.get(scrape_url)
time.sleep(2)
username = self.driver.find_element_by_id("User")
password = self.driver.find_element_by_name("Pass")
username.send_keys("your_username")
password.send_keys("your_password")
self.driver.find_element_by_xpath("//input[#name='commit']").click()
then you can do:
time.sleep(2)
next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
time.sleep(2)
etc.
EDIT: If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.
http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.
log in with scrapy api first
# call scrapy post request with after_login as callback
return FormRequest.from_response(
response,
# formxpath=formxpath,
formdata=formdata,
callback=self.browse_files
)
pass session to selenium chrome driver
# logged in previously with scrapy api
def browse_files(self, response):
print "browse files for: %s" % (response.url)
# response.headers
cookie_list2 = response.headers.getlist('Set-Cookie')
print cookie_list2
self.driver.get(response.url)
self.driver.delete_all_cookies()
# extract all the cookies
for cookie2 in cookie_list2:
cookies = map(lambda e: e.strip(), cookie2.split(";"))
for cookie in cookies:
splitted = cookie.split("=")
if len(splitted) == 2:
name = splitted[0]
value = splitted[1]
#for my particular usecase I needed only these values
if name == 'csrftoken' or name == 'sessionid':
cookie_map = {"name": name, "value": value}
else:
continue
elif len(splitted) == 1:
cookie_map = {"name": splitted[0], "value": ''}
else:
continue
print "adding cookie"
print cookie_map
self.driver.add_cookie(cookie_map)
self.driver.get(response.url)
# check if we have successfully logged in
files = self.wait_for_elements_to_be_present(By.XPATH, "//*[#id='files']", response)
print files
Related
TL;DR
In scrapy, I want the Request to wait till all spider parse callbacks finish. So the whole process needs to be sequential. Like this:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
But what is happening now:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
Long version
I am new to scrapy + selenium web scraping.
I am trying to scrape a website where the contents are being updated heavily with javascript. Firstly I am opening the website with selenium and logging in. After that, I am creating a using a downloader middleware that handles the requests with selenium and returns the responses. Below is the middleware's process_request implementation:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
In settings, I have also set CONCURRENT_REQUESTS = 1 so that, multiple driver.get() do not called and selenium can peacefully load responses one by one.
Now what I see happening is selenium opens each URL, scrapy lets selenium wait for the response to finish loading and then middleware returns the response properly (goes to if response.meta['load_url'] block).
But, after I got the response, I want to use the selenium driver (in the parse(response) functions) to click on each of the elements by yielding a Request and return the updated HTML from the middleware (the elif request.meta['click_bet'] block).
The Spider is minimally like this:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/#href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(#class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
So what is happening is, scrapy is not waiting for the spider to parse. Scrapy gets the response, and then parallelly calls parse callback and next fetch response. But selenium driver needs to be used by the parse callback function before the next request processing.
I want the requests to wait until the parse callback is finished.
I am trying to scrape a table that comes after a login page using scrapy. The Login page is http://subscribers.footballguys.com/amember/login.php, and the webpage I am trying to scrape is https://subscribers.footballguys.com/myfbg/myweeklycheatsheet.php.
I have tried to follow the tutorials from scrapy's documentation as well as here, but I am not getting any responses back (not even the hello world). Below is my code. I can also provide any other information needed. Thank you in advance!
import scrapy
class FbgQbSpider(scrapy.Spider):
name = 'fbg_qb'
allowed_domains = ['www.footballguys.com/']
start_urls = ['http://subscribers.footballguys.com/amember/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'amember_login': 'example#gmail.com', 'amember_pass': 'examplepassword'},
callback=self.after_login
)
def after_login(self, response):
#check login success before going on
View(response)
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
fetch("https://subscribers.footballguys.com/myfbg/myweeklycheatsheet.php")
players = response.css("span::text").extract()
for item in zip(players):
scraped_info = {
'player' : item[0]
}
yield scraped_info
print("hello world")
hello world is not printing because of an indentation issue.
I am trying to scrape the details from a hotel listing site this site.
Here when we click the next button for the next page the url remain the same and when looked with inspect element there site is sending a XHR request. I tried to use selenium webdriver and python and the following is my code
from time import sleep
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class DineoutRestaurantSpider(scrapy.Spider):
name = 'dineout_restaurant'
allowed_domains = ['dineout.co.in/bangalore-restaurants?search_str=']
start_urls = ['http://dineout.co.in/bangalore-restaurants?search_str=']
def start_requests(self):
self.driver = webdriver.Chrome('/Users/macbookpro/Downloads/chromedriver')
self.driver.get('https://www.dineout.co.in/bangalore-restaurants?search_str=')'
url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
**yield Request(url, callback=self.parse)**
self.logger.info('Empty message')
for i in range(1, 4):
try:
next_page = self.driver.find_element_by_xpath('//a[text()="Next "]')
sleep(11)
self.logger.info('Sleeping for 11 seconds.')
next_page.click()
url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
yield Request(url, callback=self.parse)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse(self, response):
self.logger.info('Entered parse method')
restaurants = response.xpath('//*[#class="cardBg"]')
for restaurant in restaurants:
name = restaurant.xpath('.//*[#class="titleDiv"]/h4/a/text()').extract_first()
location = restaurant.xpath('.//*[#class="location"]/a/text()').extract()
rating = restaurant.xpath('.//*[#class="rating rating-5"]/a/span/text()').extract_first()
yield{
'Name': name,
'Location': location,
'Rating': rating,
}`
In the above code the yield Request doesnot goto the parse function? Am I missing anything?I am not getting any error But the scrape output is only of the 1st page even though the pages are being iterated
Complete Python newb here so I may be asking something painfully obvious, but I've searched through this site, the Scrapy docs, and Google and I'm completely stuck on this problem.
Essentially, I want to use Scrapy's FormRequest to log me in to a site so that I can scrape and save some stats from various pages. The issue is that the response I receive from the site after submitting the form just returns me to the home page (without any login error notifications in the response body). I'm not sure how I am botching this log-in process. Although it is a pop-up login form, I don't think that should be an issue since using Firebug, I can extract the relevant html code (and xpath) for the form embedded in the webpage.
Thanks for any help. The code is pasted below (I replaced my actual username and password):
# -*- coding: utf-8 -*-
import scrapy
class dkspider(scrapy.Spider):
name = "dkspider"
allowed_domains = ["draftkings.com"]
start_urls = ['https://www.draftkings.com/contest-lobby']
def parse(self, response):
return scrapy.http.FormRequest.from_response(response,
formxpath = '//*[#id="login_form"]',
formdata = {'username' : 'myusername', 'password' : 'mypass'},
callback = self.started)
def started(self, response):
filename = 'attempt1.html'
with open(filename, 'wb') as f:
f.write(response.body)
if 'failed' in response.body:
print 'Errors!'
else:
print 'Success'
Seems like your parameters don't match(should be login instead of username) and you are missing some of them in your formdata. This is what firebug shows me is delivered when trying to log in:
Seems like layoutType and returnUrl can just be hardcoded in but profillingSessionId needs to be retrieved from the page source. I checked the source and found this there:
so your Spider should look something like this:
def parse(self, response):
return FormRequest(
url='https://www.draftkings.com/account/login',
formdata={'login': 'login', # login instead of username
'password': 'password',
'profillingSessionId': ''.join(
response.xpath("//input[#id='tmxSessionId']/#value").extract()),
'returnUrl': '',
'layoutType': '2'},
callback=self.started)
def started(self, response):
# Reload the landing page
return Request(self.start_urls[0], self.logged_in)
def logged_in(self, response):
# logged in page here
pass
I tried to use scrapy to complete the login and collect my project commit count. And here is the code.
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import Spider
from scrapy.utils.response import open_in_browser
class GitSpider(Spider):
name = "github"
allowed_domains = ["github.com"]
start_urls = ["https://www.github.com/login"]
def parse(self, response):
formdata = {'login': 'username',
'password': 'password' }
yield FormRequest.from_response(response,
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)
def parse1(self, response):
open_in_browser(response)
After running the code
scrapy runspider github.py
It should show me the result page of the form, which should be a failed login in the same page as the username and password is fake. However it shows me the search page. The log file is located in pastebin
How should the code be fixed? Thanks in advance.
Your problem is that FormRequest.from_response() uses a different form - a "search form". But, you wanted it to use a "log in form" instead. Provide a formnumber argument:
yield FormRequest.from_response(response,
formnumber=1,
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)
Here is what I see opened in the browser after applying the change (used "fake" user):
Solution using webdriver.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.contrib.spiders import CrawlSpider
class GitSpider(CrawlSpider):
name = "gitscrape"
allowed_domains = ["github.com"]
start_urls = ["https://www.github.com/login"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
login_form = self.driver.find_element_by_name('login')
password_form = self.driver.find_element_by_name('password')
commit = self.driver.find_element_by_name('commit')
login_form.send_keys("yourlogin")
password_form.send_keys("yourpassword")
actions = ActionChains(self.driver)
actions.click(commit)
actions.perform()
# by this point you are logged to github and have access
#to all data in the main menĂ¹
time.sleep(3)
self.driver.close()
Using the "formname" argument also works:
yield FormRequest.from_response(response,
formname='Login',
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)