Scraping website with credentials using scrapy issue with FORMDATA - python

Hope everyone is safe and sound,
I am currently training on scrapy and decided to try scraping a website (Glassdoor) that requires logins.
I am stuck and wonder if anyone could check what I have done so far and give me a hand?
1)I loaded the glassdoor login page and open the inspect tool (in Chrome),
2)Selected the Network section and enter my logins in the page, once logged I looked for the login_input.htm file with a 302 status (POST) once selected I got into the the HEADER section but I cannot not find the FORMDATA section. So I do not have all the information to add in my code.
I tried a lot of online resources but cannot find a solution to this?
I also placed the code I started to work with:
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class GdSpider(scrapy.Spider):
name = 'gd'
allowed_domains = ['https://www.glassdoor.co.uk/profile/login_input.htm']
start_urls = ('http://https://www.glassdoor.co.uk/profile/login_input.htm/',)
def parse(self, response):
return FormRequest.from_response(response,
formdata={'password': 'mypassword',
'username': 'myusername'},
callback=self.scrape_pages)
def scrape_pages(self, response):
open_in_browser(response)
Could anyone let me know what I did wrong please?
Thank you,
Arnaud

Glasdoor's login is a JavaScript rendered popup, if you disable JS you will see that nothing renders when you try to click the Sign In link or opening the link you have given.
This seems to be what you are looking for:
https://www.glassdoor.com/profile/ajax/loginAjax.htm
when you open the Sign In popup and try to login using any credentials (can be wrong, does not matter), you will see loginAjax.htm pop up in the Network tab. This one has a form that sends credentials by POST to the link I posted above.
Unfortunately it also does send a token with the credentials, so using this to log in might prove difficult.
For sending data you can use _urlencode from from scrapy.http.request.form import _urlencode like this:
inputs = [("key", "value"),]
body = _urlencode(inputs, response.encoding)
and send the body via POST to the above URL (inputs have to be a list of tuples) building a normal Scrapy Request.

Related

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Scrapy login failed

I am trying to log in to a website using Scrapy. I have successfully done this for other websites but I seem to be having trouble this time and I'm not sure why.
Attached are screenshots of the response I get when running this code and an inspection of the page I'm trying to log in to.
import scrapy
class iauditorSpider(scrapy.Spider):
name = "iauditor"
start_urls = ['https://app.safetyculture.io/login.html']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formid = 'login-form',
formdata={'email': 'example#email.com',
'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
#Check login success before continuing
if(b'Those details don\'t appear to be correct.' in response.body):
self.logger.error("Login Failed.")
return
My Response,
Page Inspect
I've successfully logged in to other websites before with almost identical code so I am confused why it's not working this time.
Quick guess, you must send headers and cookies to perform login.
Go to the login page
Open Developer Tools and go to Network tab
Click on Preserve Logs or Persist to make sure logs are kept when you are redirected to another page
Now login to that site, and notice what request is sent when you click on login button
Now right-click on that and click Copy as cURL (bash)
Now go to https://curl.trillworks.com/ and paste your cURL command there
That is it, now you have got the exact Python code to perform login.

How to use requests to login to this website

I'm trying to automate some tasks with python, and webscraping. but first, I need to login to a website I have an account on.
I've seen several examples on stack overflow, but for some reason, this website won't let me login using requests. Can anyone tell me what I'm doing wrong?
The webpage:
https://www.americanbulls.com/Signin.aspx?lang=en
the form variables:
ctl00$MainContent$uEmail
ctl00$MainContent$uPassword
Is it the variable names have '$' in them?
Any help would be greatly appreciated.
import sys
print(sys.path)
sys.path.append('C:\program files\python36\lib\site-packages\pip\_vendor')
import requests
import sys
import time
EMAIL = '<my_email>'
PASSWORD = '<my_password>'
URL = 'https://www.americanbulls.com/Signin.aspx?lang=en'
# Start a session so we can have persistant cookies
session = requests.session()
#This is the form data that the page sends when logging in
login_data = {
'ctl00$MainContent$uEmail': EMAIL,
'ctl00$MainContent$uPassword': PASSWORD
}
# Authenticate
r = session.post(URL, data=login_data, timeout=15, verify=True)
# Try accessing a page that requires you to be logged in
r = session.get('https://www.americanbulls.com/members/SignalPage.aspx?lang=en&Ticker=SQ')
print(r.url)
I submitted a form using test#test.test as the email and test as the password, and when I looked at the headers of the request I'd sent in the network tab of chrome dev tools it said I submitted the following form data.
ctl00$ScriptManager1:ctl00$MainContent$UpdatePanel|ctl00$MainContent$btnSubmit
__LASTFOCUS:
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTE5MzMzODAyNzIPZBYCZg9kFgICAQ9kFgICAw9kFgICBQ9kFhICAQ8WAh4FY2xhc3MFFmhlYWRlcmNvbnRhaW5lcl9zYWZhcmkWCgIBDzwrAAkCAA8WAh4OXyFVc2VWaWV3U3RhdGVnZAYPZBAWAmYCARYCPCsADAEAFgYeC05hdmlnYXRlVXJsBRVSZWdpc3Rlci5hc3B4P2xhbmc9ZW4eBFRleHQFCFJlZ2lzdGVyHgdUb29sVGlwBTFSZWdpc3RlciBub3cgdG8gZ2V0IGFjY2VzcyB0byBleGNsdXNpdmUgZmVhdHVyZXMhPCsADAEAFgYfAwUHU2lnbiBJbh8CBRNTaWduaW4uYXNweD9sYW5nPWVuHghTZWxlY3RlZGdkZAIDDw8WBB8CBRREZWZhdWx0LmFzcHg/bGFuZz1lbh4ISW1hZ2VVcmwFGH4vaW1nL2FtZXJpY2FuYnVsbHMxLmdpZmRkAgcPZBYCAgEPPCsACQIADxYCHwFnZAYPZBAWAWYWATwrAAwCABYCHwMFB0VuZ2xpc2gBD2QQFghmAgECAgIDAgQCBQIGAgcWCDwrAAwCABYGHwMFB0VuZ2xpc2gfAgUUL1NpZ25pbi5hc3B4P2xhbmc9ZW4fBWcCFCsAAhYCHgNVcmwFEn4vaW1nL2VuaWNvbjAxLnBuZ2Q8KwAMAgAWBB8DBQdEZXV0c2NoHwIFFC9TaWduaW4uYXNweD9sYW5nPWRlAhQrAAIWAh8HBRJ+L2ltZy9kZWljb24wMS5wbmdkPCsADAIAFgQfAwUG5Lit5paHHwIFFC9TaWduaW4uYXNweD9sYW5nPXpoAhQrAAIWAh8HBRJ+L2ltZy96aGljb24wMS5wbmdkPCsADAIAFgQfAwUJRnJhbsOnYWlzHwIFFC9TaWduaW4uYXNweD9sYW5nPWZyAhQrAAIWAh8HBRJ+L2ltZy9mcmljb24wMS5wbmdkPCsADAIAFgQfAwUIVMO8cmvDp2UfAgUUL1NpZ25pbi5hc3B4P2xhbmc9dHICFCsAAhYCHwcFEn4vaW1nL3RyaWNvbjAxLnBuZ2Q8KwAMAgAWBB8DBQlJbmRvbmVzaWEfAgUUL1NpZ25pbi5hc3B4P2xhbmc9aWQCFCsAAhYCHwcFEn4vaW1nL2lkaWNvbjAxLnBuZ2Q8KwAMAgAWBB8DBQhFc3Bhw7FvbB8CBRQvU2lnbmluLmFzcHg/bGFuZz1lcwIUKwACFgIfBwUSfi9pbWcvZXNpY29uMDEucG5nZDwrAAwCABYEHwMFCEl0YWxpYW5vHwIFFC9TaWduaW4uYXNweD9sYW5nPWl0AhQrAAIWAh8HBRJ+L2ltZy9pdGljb24wMS5wbmdkZGRkAgkPZBYCAgEPPCsABAEADxYEHgVWYWx1ZQULTGFzdCBVcGRhdGUeB1Zpc2libGVoZGQCCw9kFgICAQ88KwAEAQAPFgIfCWhkZAIDDxYCHwAFGG1haW5tZW51Y29udGFpbmVyX3NhZmFyaRYGAgEPPCsACQIADxYCHwFnZAYPZBAWDWYCAQICAgMCBAIFAgYCBwIIAgkCCgILAgwWDTwrAAwBABYEHwIFFERlZmF1bHQuYXNweD9sYW5nPWVuHwMFBEhPTUU8KwAMAQAWBB8DBQRBTUVYHwIFKVNpZ25hbExpc3QuYXNweD9sYW5nPWVuJk1hcmtldFN5bWJvbD1BTUVYPCsADAEAFgQfAwUETllTRR8CBSlTaWduYWxMaXN0LmFzcHg/bGFuZz1lbiZNYXJrZXRTeW1ib2w9TllTRTwrAAwBABYEHwMFBk5BU0RBUR8CBStTaWduYWxMaXN0LmFzcHg/bGFuZz1lbiZNYXJrZXRTeW1ib2w9TkFTREFRPCsADAEAFgQfAwUIT1RDIFBJTksfAgUpU2lnbmFsTGlzdC5hc3B4P2xhbmc9ZW4mTWFya2V0U3ltYm9sPVBJTks8KwAMAQAWBB8DBQlQUkVGRVJSRUQfAgUuU2lnbmFsTGlzdC5hc3B4P2xhbmc9ZW4mTWFya2V0U3ltYm9sPVBSRUZFUlJFRDwrAAwBABYEHwMFCFdBUlJBTlRTHwIFLVNpZ25hbExpc3QuYXNweD9sYW5nPWVuJk1hcmtldFN5bWJvbD1XQVJSQU5UUzwrAAwBABYEHwMFB0lOREVYRVMfAgUcSW5kZXhTaWduYWxMaXN0LmFzcHg/bGFuZz1lbjwrAAwCABYEHwMFAmZ4HwIFGVNpZ25hbExpc3RGWC5hc3B4P2xhbmc9ZW4KPCsADgEAFgYeCUZvcmVDb2xvcgpgHgtGb250X0l0YWxpY2ceBF8hU0IChCA8KwAMAQAWAh8JaDwrAAwBABYCHwloPCsADAEAFgIfCWg8KwAMAQAWAh8JaGRkAgMPFCsABA8WBB8IBRRTdXBwb3J0LmFzcHg/bGFuZz1lbh8JaGRkZDwrAAUBABYCHwMFBEhlbHBkAgUPZBYCAgMPPCsABAEADxYCHwgFJmh0dHBzOi8vd3d3LnR3aXR0ZXIuY29tL2FtZXJpY2FuX0J1bGxzZGQCBQ8WAh8ABRdzdWJtZW51Y29udGFpbmVyX3NhZmFyaRYKAgEPPCsACQIADxYCHwFnZAYPZBAWAWYWATwrAAwBABYEHwIFFVJlZ2lzdGVyLmFzcHg/bGFuZz1lbh8DBTFSZWdpc3RlciBub3cgdG8gZ2V0IGFjY2VzcyB0byBleGNsdXNpdmUgZmVhdHVyZXMhZGQCAw88KwAJAgAPFgQfAWcfCWhkBg9kEBYBZhYBPCsADAEAFgQfAgUTU2lnbmluLmFzcHg/bGFuZz1lbh8DBQdTaWduIEluZGQCBQ88KwAJAgAPFgIfAWdkBg9kEBYBZhYBPCsADAEAFgQfAgUfTWVtYmVyc2hpcEJlbmVmaXRzLmFzcHg/bGFuZz1lbh8DBRNNZW1iZXJzaGlwIEJlbmVmaXRzZGQCBw88KwAJAQAPFgQfAWcfCWhkZAILDzwrAAYBAzwrAAgBABYCHghOdWxsVGV4dAUMRW50ZXIgU3ltYm9sZAIHDxYCHwAFEGNvbnRhaW5lcl9zYWZhcmkWAgIBD2QWAgIBD2QWAgIBD2QWAgIDD2QWAmYPZBYcAgEPPCsABAEADxYCHwgFB1NpZ24gSW5kZAIDDzwrAAQBAA8WAh8IBQlOZXcgVXNlcj9kZAIFDxQrAAQPFgIfCAUVUmVnaXN0ZXIuYXNweD9sYW5nPWVuZGRkPCsABQEAFgIfAwUIUmVnaXN0ZXJkAgcPPCsABAEADxYCHwhlZGQCCQ88KwAEAQAPFgIfCAUFRW1haWxkZAINDw8WAh4MRXJyb3JNZXNzYWdlBQ1JbnZhbGlkIGVtYWlsZGQCDw8PFgIfDgUNSW52YWxpZCBlbWFpbGRkAhEPPCsABAEADxYCHwgFCFBhc3N3b3JkZGQCFQ8PFgIfDgUUUGFzc3dvcmQgaXMgcmVxdWlyZWRkZAIXDzwrAAQBAA8WAh8IBQtSZW1lbWJlciBNZWRkAhsPDxYCHwMFB1NpZ24gSW5kZAIdDw8WAh8DBQZDYW5jZWxkZAIfDzwrAAQBAA8WAh8IBShJZiB5b3UgY2Fubm90IHJlYWNoIHlvdXIgYWNjb3VudCwgcGxlYXNlZGQCIQ8UKwAEDxYCHwgFGVNlbmRQYXNzd29yZC5hc3B4P2xhbmc9ZW5kZGQ8KwAFAQAWAh8DBQtjbGljayBoZXJlLmQCCQ8WAh8ABRB3aGl0ZWJhbnRfc2FmYXJpZAILDxYCHwAFG3N1cHBvcnRtZW51Y29udGFpbmVyX3NhZmFyaRYCAgEPPCsACQIADxYCHwFnZAYPZBAWBmYCAQICAgMCBAIFFgY8KwAMAQAWBB8DBQhBYm91dCBVcx8CBRRBYm91dFVzLmFzcHg/bGFuZz1lbjwrAAwBABYEHwMFB1N1cHBvcnQfAgUUU3VwcG9ydC5hc3B4P2xhbmc9ZW48KwAMAQAWBB8DBQdQcml2YWN5HwIFFFByaXZhY3kuYXNweD9sYW5nPWVuPCsADAEAFgQfAwUDVE9THwIFEFRvcy5hc3B4P2xhbmc9ZW48KwAMAQAWBB8DBRNNZW1iZXJzaGlwIEJlbmVmaXRzHwIFH01lbWJlcnNoaXBCZW5lZml0cy5hc3B4P2xhbmc9ZW48KwAMAQAWBB8DBQ9JbXBvcnRhbnQgTGlua3MfAgUbSW1wb3J0YW50TGlua3MuYXNweD9sYW5nPWVuZGQCDQ8WAh8ABRdmb290ZXJjb250YWluZXIxX3NhZmFyaRYCAgEPPCsACQIADxYCHwFnZAYPZBAWCGYCAQICAgMCBAIFAgYCBxYIPCsADAEAFgYfAwUHRW5nbGlzaB8FZx8CBRZ+Ly9TaWduaW4uYXNweD9sYW5nPWVuPCsADAEAFgQfAwUHRGV1dHNjaB8CBRZ+Ly9TaWduaW4uYXNweD9sYW5nPWRlPCsADAEAFgQfAwUG5Lit5paHHwIFFn4vL1NpZ25pbi5hc3B4P2xhbmc9emg8KwAMAQAWBB8DBQlGcmFuw6dhaXMfAgUWfi8vU2lnbmluLmFzcHg/bGFuZz1mcjwrAAwBABYEHwMFCFTDvHJrw6dlHwIFFn4vL1NpZ25pbi5hc3B4P2xhbmc9dHI8KwAMAQAWBB8DBQlJbmRvbmVzaWEfAgUWfi8vU2lnbmluLmFzcHg/bGFuZz1pZDwrAAwBABYEHwMFCEVzcGHDsW9sHwIFFn4vL1NpZ25pbi5hc3B4P2xhbmc9ZXM8KwAMAQAWBB8DBQhJdGFsaWFubx8CBRZ+Ly9TaWduaW4uYXNweD9sYW5nPWl0ZGQCDw8WAh8ABRdmb290ZXJjb250YWluZXIzX3NhZmFyaRYOAgEPFgIeCWlubmVyaHRtbAUMRGlzY2xhaW1lcnM6ZAIDDxYCHw8FjwVBbWVyaWNhbmJ1bGxzLmNvbSBMTEMgaXMgbm90IHJlZ2lzdGVyZWQgYXMgYW4gaW52ZXN0bWVudCBhZHZpc2VyIHdpdGggdGhlIFUuUy4gU2VjdXJpdGllcyBhbmQgRXhjaGFuZ2UgQ29tbWlzc2lvbi4gIFJhdGhlciwgQW1lcmljYW5idWxscy5jb20gTExDIHJlbGllcyB1cG9uIHRoZSDigJxwdWJsaXNoZXLigJlzIGV4Y2x1c2lvbuKAnSBmcm9tIHRoZSBkZWZpbml0aW9uIG9mIGludmVzdG1lbnQgYWR2aXNlciBhcyBwcm92aWRlZCB1bmRlciBTZWN0aW9uIDIwMihhKSgxMSkgb2YgdGhlIEludmVzdG1lbnQgQWR2aXNlcnMgQWN0IG9mIDE5NDAgYW5kIGNvcnJlc3BvbmRpbmcgc3RhdGUgc2VjdXJpdGllcyBsYXdzLiBBcyBzdWNoLCBBbWVyaWNhbmJ1bGxzLmNvbSBMTEMgZG9lcyBub3Qgb2ZmZXIgb3IgcHJvdmlkZSBwZXJzb25hbGl6ZWQgaW52ZXN0bWVudCBhZHZpY2UuIFRoaXMgc2l0ZSBhbmQgYWxsIG90aGVycyBvd25lZCBhbmQgb3BlcmF0ZWQgYnkgQW1lcmljYW5idWxscy5jb20gTExDIGFyZSBib25hIGZpZGUgcHVibGljYXRpb25zIG9mIGdlbmVyYWwgYW5kIHJlZ3VsYXIgY2lyY3VsYXRpb24gb2ZmZXJpbmcgaW1wZXJzb25hbCBpbnZlc3RtZW50LXJlbGF0ZWQgYWR2aWNlIHRvIG1lbWJlciBhbmQgL29yIHByb3NwZWN0aXZlIG1lbWJlcnMuZAIFDxYCHw8FrAJBbWVyaWNhbmJ1bGxzLmNvbSBpcyBhbiBpbmRlcGVuZGVudCB3ZWJzaXRlLiBBbWVyaWNhbmJ1bGxzLmNvbSBMTEMgZG9lcyBub3QgcmVjZWl2ZSBjb21wZW5zYXRpb24gYnkgYW55IGRpcmVjdCBvciBpbmRpcmVjdCBtZWFucyBmcm9tIHRoZSBzdG9ja3MsIHNlY3VyaXRpZXMgYW5kIG90aGVyIGluc3RpdHV0aW9ucyBvciBhbnkgdW5kZXJ3cml0ZXJzIG9yIGRlYWxlcnMgYXNzb2NpYXRlZCB3aXRoIHRoZSBicm9hZGVyIG5hdGlvbmFsIG9yIGludGVybmF0aW9uYWwgZm9yZXgsIGNvbW1vZGl0eSBhbmQgc3RvY2sgbWFya2V0cy5kAgcPFgIfDwX3CFRoZXJlZm9yZSwgQW1lcmljYW5idWxscy5jb20gYW5kIEFtZXJpY2FuYnVsbHMuY29tIExMQyBpcyBleGVtcHQgZnJvbSB0aGUgZGVmaW5pdGlvbiBvZiDigJxpbnZlc3RtZW50IGFkdmlzZXLigJ0gYXMgcHJvdmlkZWQgdW5kZXIgU2VjdGlvbiAyMDIoYSkgKDExKSBvZiB0aGUgSW52ZXN0bWVudCBBZHZpc2VycyBBY3Qgb2YgMTk0MCBhbmQgY29ycmVzcG9uZGluZyBzdGF0ZSBzZWN1cml0aWVzIGxhd3MsIGFuZCBoZW5jZSByZWdpc3RyYXRpb24gYXMgc3VjaCBpcyBub3QgcmVxdWlyZWQuIFdlIGFyZSBub3QgYSByZWdpc3RlcmVkIGJyb2tlci1kZWFsZXIuIE1hdGVyaWFsIHByb3ZpZGVkIGJ5IEFtZXJpY2FuYnVsbHMuY29tIExMQyBpcyBmb3IgaW5mb3JtYXRpb25hbCBwdXJwb3NlcyBvbmx5LCBhbmQgdGhhdCBubyBtZW50aW9uIG9mIGEgcGFydGljdWxhciBzZWN1cml0eSBpbiBhbnkgb2Ygb3VyIG1hdGVyaWFscyBjb25zdGl0dXRlcyBhIHJlY29tbWVuZGF0aW9uIHRvIGJ1eSwgc2VsbCwgb3IgaG9sZCB0aGF0IG9yIGFueSBvdGhlciBzZWN1cml0eSwgb3IgdGhhdCBhbnkgcGFydGljdWxhciBzZWN1cml0eSwgcG9ydGZvbGlvIG9mIHNlY3VyaXRpZXMsIHRyYW5zYWN0aW9uIG9yIGludmVzdG1lbnQgc3RyYXRlZ3kgaXMgc3VpdGFibGUgZm9yIGFueSBzcGVjaWZpYyBwZXJzb24uIFRvIHRoZSBleHRlbnQgdGhhdCBhbnkgb2YgdGhlIGluZm9ybWF0aW9uIG9idGFpbmVkIGZyb20gQW1lcmljYW5idWxscy5jb20gTExDIG1heSBiZSBkZWVtZWQgdG8gYmUgaW52ZXN0bWVudCBvcGluaW9uLCBzdWNoIGluZm9ybWF0aW9uIGlzIGltcGVyc29uYWwgYW5kIG5vdCB0YWlsb3JlZCB0byB0aGUgaW52ZXN0bWVudCBuZWVkcyBvZiBhbnkgc3BlY2lmaWMgcGVyc29uLiBBbWVyaWNhbmJ1bGxzLmNvbSBMTEMgZG9lcyBub3QgcHJvbWlzZSwgZ3VhcmFudGVlIG9yIGltcGx5IHZlcmJhbGx5IG9yIGluIHdyaXRpbmcgdGhhdCBhbnkgaW5mb3JtYXRpb24gcHJvdmlkZWQgdGhyb3VnaCBvdXIgd2Vic2l0ZXMsIGNvbW1lbnRhcmllcywgb3IgcmVwb3J0cywgaW4gYW55IHByaW50ZWQgbWF0ZXJpYWwsIG9yIGRpc3BsYXllZCBvbiBhbnkgb2Ygb3VyIHdlYnNpdGVzLCB3aWxsIHJlc3VsdCBpbiBhIHByb2ZpdCBvciBsb3NzLmQCCQ8WAh8PBeMGR292ZXJubWVudCByZWd1bGF0aW9ucyByZXF1aXJlIGRpc2Nsb3N1cmUgb2YgdGhlIGZhY3QgdGhhdCB3aGlsZSB0aGVzZSBtZXRob2RzIG1heSBoYXZlIHdvcmtlZCBpbiB0aGUgcGFzdCwgcGFzdCByZXN1bHRzIGFyZSBub3QgbmVjZXNzYXJpbHkgaW5kaWNhdGl2ZSBvZiBmdXR1cmUgcmVzdWx0cy4gV2hpbGUgdGhlcmUgaXMgYSBwb3RlbnRpYWwgZm9yIHByb2ZpdHMgdGhlcmUgaXMgYWxzbyBhIHJpc2sgb2YgbG9zcy4gVGhlcmUgaXMgc3Vic3RhbnRpYWwgcmlzayBpbiBzZWN1cml0eSB0cmFkaW5nLiBMb3NzZXMgaW5jdXJyZWQgaW4gY29ubmVjdGlvbiB3aXRoIHRyYWRpbmcgc3RvY2tzIG9yIGZ1dHVyZXMgY29udHJhY3RzIGNhbiBiZSBzaWduaWZpY2FudC4gWW91IHNob3VsZCB0aGVyZWZvcmUgY2FyZWZ1bGx5IGNvbnNpZGVyIHdoZXRoZXIgc3VjaCB0cmFkaW5nIGlzIHN1aXRhYmxlIGZvciB5b3UgaW4gdGhlIGxpZ2h0IG9mIHlvdXIgZmluYW5jaWFsIGNvbmRpdGlvbiBzaW5jZSBhbGwgc3BlY3VsYXRpdmUgdHJhZGluZyBpcyBpbmhlcmVudGx5IHJpc2t5IGFuZCBzaG91bGQgb25seSBiZSB1bmRlcnRha2VuIGJ5IGluZGl2aWR1YWxzIHdpdGggYWRlcXVhdGUgcmlzayBjYXBpdGFsLiBOZWl0aGVyIEFtZXJpY2FuYnVsbHMuY29tIExMQywgbm9yIEFtZXJpY2FuYnVsbHMuY29tIG1ha2VzIGFueSBjbGFpbXMgd2hhdHNvZXZlciByZWdhcmRpbmcgcGFzdCBvciBmdXR1cmUgcGVyZm9ybWFuY2UuIEFsbCBleGFtcGxlcywgY2hhcnRzLCBoaXN0b3JpZXMsIHRhYmxlcywgY29tbWVudGFyaWVzLCBvciByZWNvbW1lbmRhdGlvbnMgYXJlIGZvciBlZHVjYXRpb25hbCBvciBpbmZvcm1hdGlvbmFsIHB1cnBvc2VzIG9ubHkuZAILDxYCHw8F3wZEaXNwbGF5ZWQgaW5mb3JtYXRpb24gaXMgYmFzZWQgb24gd2lkZWx5LWFjY2VwdGVkIG1ldGhvZHMgb2YgdGVjaG5pY2FsIGFuYWx5c2lzIGJhc2VkIG9uIGNhbmRsZXN0aWNrIHBhdHRlcm5zLiBBbGwgaW5mb3JtYXRpb24gaXMgZnJvbSBzb3VyY2VzIGRlZW1lZCB0byBiZSByZWxpYWJsZSwgYnV0IHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0byB0aGUgYWNjdXJhY3kuIExvbmctdGVybSBpbnZlc3RtZW50IHN1Y2Nlc3MgcmVsaWVzIG9uIHJlY29nbml6aW5nIHByb2JhYmlsaXRpZXMgaW4gcHJpY2UgYWN0aW9uIGZvciBwb3NzaWJsZSBmdXR1cmUgb3V0Y29tZXMsIHJhdGhlciB0aGFuIGFic29sdXRlIGNlcnRhaW50eSDigJMgcmlzayBtYW5hZ2VtZW50IGlzIGNyaXRpY2FsIGZvciBzdWNjZXNzLiBFcnJvciBhbmQgdW5jZXJ0YWludHkgYXJlIHBhcnQgb2YgYW55IGZvcm0gb2YgbWFya2V0IGFuYWx5c2lzLiBQYXN0IHBlcmZvcm1hbmNlIGlzIG5vIGd1YXJhbnRlZSBvZiBmdXR1cmUgcGVyZm9ybWFuY2UuIEludmVzdG1lbnQvIHRyYWRpbmcgY2FycmllcyBzaWduaWZpY2FudCByaXNrIG9mIGxvc3MgYW5kIHlvdSBzaG91bGQgY29uc3VsdCB5b3VyIGZpbmFuY2lhbCBwcm9mZXNzaW9uYWwgYmVmb3JlIGludmVzdGluZyBvciB0cmFkaW5nLiBZb3VyIGZpbmFuY2lhbCBhZHZpc2VyIGNhbiBnaXZlIHlvdSBzcGVjaWZpYyBmaW5hbmNpYWwgYWR2aWNlIHRoYXQgaXMgYXBwcm9wcmlhdGUgdG8geW91ciBuZWVkcywgcmlzay10b2xlcmFuY2UsIGFuZCBmaW5hbmNpYWwgcG9zaXRpb24uIEFueSB0cmFkZXMgb3IgaGVkZ2VzIHlvdSBtYWtlIGFyZSB0YWtlbiBhdCB5b3VyIG93biByaXNrIGZvciB5b3VyIG93biBhY2NvdW50LmQCDQ8WAh8PBdsBWW91IGFncmVlIHRoYXQgQW1lcmljYW5idWxscy5jb20gYW5kIEFtZXJpY2FuYnVsbHMuY29tIExMQyBpdHMgcGFyZW50IGNvbXBhbnksIHN1YnNpZGlhcmllcywgYWZmaWxpYXRlcywgb2ZmaWNlcnMgYW5kIGVtcGxveWVlcyBzaGFsbCBub3QgYmUgbGlhYmxlIGZvciBhbnkgZGlyZWN0LCBpbmRpcmVjdCwgaW5jaWRlbnRhbCwgc3BlY2lhbCBvciBjb25zZXF1ZW50aWFsIGRhbWFnZXMuZAIRDxYCHwAFHGJvdHRvbWJhbm5lcmNvbnRhaW5lcl9zYWZhcmlkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYIBQ9jdGwwMCRMb2dpbk1lbnUFC2N0bDAwJG1NYWluBQ5jdGwwMCRNYWluTWVudQUWY3RsMDAkRnJlZVJlZ2lzdGVyTWVudQUYY3RsMDAkTWVtYmVyc2hpcEJlbmVmaXRzBRJjdGwwMCRTZWFyY2hCdXR0b24FEWN0bDAwJFN1cHBvcnRNZW51BRNjdGwwMCRMYW5ndWFnZXNNZW51NlBIALTovVw6LJEOuDXyhCTS4+M=
__VIEWSTATEGENERATOR:ECDA716A
__EVENTVALIDATION:/wEdAAVswH4c0JxRe30eXDiX0bhcXr7XOgipC8DNcjKl0sbO7fwNII+YQgXfxmh/KZz6Myr4IcjYoaGuA6R78NuEHgsNQX9+ScDGDIM47zqhQCjs5Ynd+DEUmo0/Xv9Oy6tQgLO7ip/G
ctl00$mMain:{"selectedItemIndexPath":"0i0","checkedState":""}
ctl00$MainMenu:{"selectedItemIndexPath":"","checkedState":""}
ctl00$FreeRegisterMenu:{"selectedItemIndexPath":"","checkedState":""}
ctl00$MembershipBenefits:{"selectedItemIndexPath":"","checkedState":""}
ctl00$SearchBox$State:{"rawValue":"","validationState":""}
ctl00$SearchBox:Enter Symbol
ctl00$MainContent$uEmail:test#test.test
ctl00$MainContent$uPassword:test
ctl00$MainContent$ASPxCheckBox1:I
ctl00$SupportMenu:{"selectedItemIndexPath":"","checkedState":""}
ctl00$LanguagesMenu:{"selectedItemIndexPath":"","checkedState":""}
DXScript:1_304,1_185,1_298,1_211,1_221,1_188,1_182,1_290,1_296,1_279,1_198,1_209,1_217,1_201
DXCss:1_40,1_50,1_53,1_51,1_4,1_16,1_13,0_4617,0_4621,1_14,1_17,Styles/Site.css,img/favicon.ico,https://adservice.google.com/adsid/integrator.js?domain=www.americanbulls.com,https://securepubads.g.doubleclick.net/static/3p_cookie.html
__ASYNCPOST:true
ctl00$MainContent$btnSubmit:Sign In
Your code looks great. It just looks like the script is failing because you're not submitting everything that the browser would normally submit. You could try continuing down the path you are on, submit all of the extra form data, and hope you don't have to bother with adding a CSRF token (a CSRF token is a randomly generated string that you're required to send back), or you can do as Sidharth Shah sugggested and use Selenium.
There is a Firefox extension for Selenium that will allow you to start recording your mouse and keyboard actions, and then when you are done, you can export the results in Python. That Python code will depend on the Selenium library and a Selenium Chrome/Firefox/IE driver. When you run your Python code, a new browser window will open up, controlled by the selenium driver and your Python code. It's pretty cool, your basically writing Python code that controls a browser window. You will have to modify the Python code that the Firefox extension gives you a little bit to read all of the data from the page and start doing stuff with it after you're logged in, but the code for opening the browser window, navigating to athe login page, filling in your login credentials and submitting the form, and navigating to other pages after you're logged in will all be written for you.

Login automation and crawling using Scrapy Python

I have been trying to write a script to retrieve my accepted solutions on spoj See more
I got stuck in automating the login process. I found Scrapy difficult to understand. After going through the docs and the code samples many times I got a vague picture of what happens behind the scene and this is where I stand now:
(I have commented the code at required places)
import os
import os.path
import scrapy
import urllib.request
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from bs4 import BeautifulSoup
class LoginSpider(scrapy.Spider):
name = 'spoj'
start_urls = ['http://www.spoj.com/login']
outputFile = open('output.txt' , 'w')
def parse(self, response):
username = input('Enter username\n')
password = input('Enter password\n')
return scrapy.FormRequest.from_response(
response,
formdata={'username': username, 'password': password},
callback=self.after_login
)
def after_login(self, response):
# Even if I type in correct username and password it always leads to
# authentication faliure and the following if statement evaluates to true.
if str.encode('Authentication failed!') in response.body:
self.logger.error("Login failed")
print ('Incorrect credentials')
return
print('lol') # ofcourse this isn't printed
return scrapy.Request(url = "http://www.spoj.com/myaccount/" , callback = self.retrieve_codes )
# needless to say, the following function is never called
def retrieve_codes(self, response):
print('Hello testing!')
url = 'http://www.spoj.com/files/src/16731976/'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html , 'html.parser')
self.outputFile.write(str(soup.prettify()))
In the docs it was if "authentication failed" in response.body: which I changed to
if str.encode('Authentication failed!') in response.body: reasons being
I was getting this error a byte like object is required not 'str'
In spoj on entering wrong credentials Authentication failed! is displayed and not authentication failed. We need to be precise here.
Please tell me where I'm doing wrong. I haven't found any good resources on the net that discusses the form validation thing in detail. After seeing this code from docs my initial questions were,
Is this the only way to do it?
Will this method work for every website? Because I learnt that complexity of this process varies from site to site.
Can I find an even more descriptive explanation of what is happening behind?
I have also tried using robobrowser but in vain.
I was kind of expecting a good documentation like that of beautiful soup.
Thanks!
You are using the wrong formdata field names. You need to adjust the example code from the scrapy docs to the specific website. Currently you use username and password as formdata fields.
If you use the developer tools of your browser while logging in you can see that the fields that are sent by POST are labeled login_user and password.
So this should be easy to fix :-)

python selenium: possible to cancel redirect on driver.get()?

Is there a way to stop a url from redirecting?
driver.get('http://loginrequired.com')
This redirects me to another page but I want it to stay on that page without redirecting by default.
There are two ways that what users call "redirection" typically happens:
You load a page and the page loads some JavaScript code which performs a test and decides to load a different page. This process can be interrupted in some browsers by hitting the ESCAPE key. Selenium can send an ESCAPE key.
However, this redirection could happen before Selenium gives control back to your script. Whether it would work in any specific case depends on the page being loaded.
You load a page and get an HTTP 3xx (301, 303, 304, etc.) response from the server. There are no opportunities for users to interrupt these redirections in their browser, so Selenium does not provide the means to interrupt or prevent them.
So there is no surefire way to prevent a redirection in Selenium.
A solution, in case you do not need to visualize the page but access to the source of "http://loginrequired.com" would be the usage of Selenium with Scrapy.
Basically you tell the Scrapy middleware to stop redirecting, and while the spider access to the page the redirect is handle the redirection (302).
In the setting.py you have to set
"REDIRECT_ENABLED=False"
The spider code is:
class LoginSpider(CrawlSpider):
name = "login"
allowed_domains = ['loginrequired.com']
start_urls = ['http://loginrequired.com']
handle_httpstatus_list = [302]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="http://loginrequired.com", callback=self.after_302)
def after_302(self, response):
print response.url
# Your code to analysis the page by here
Idea taken from how to handle 302 redirect in scrapy

Categories