Scrape website that uses javascript with python - python

I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.

It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')

Related

Logging into Google account with Python requests

On multiple login pages, a google login is required in order to proceed. I would like to use requests library in python in order to log myself in. Normally this would be easy with the requests library, however I have not been able to get it to work. I am not sure if this is due to some restriction Google has made (perhaps I need to use their API?), or if it is because the Google login page requires the user to enter their email first, then press submit, and then enter their password, etc.
This problem has been asked before over here, but none of the solutions work for me. Currently I've been using code provided in this solution: Log into Google account using Python? as shown here:
from bs4 import BeautifulSoup
import requests
my_email = "email_placeholder#gmail.com" # my email is here
my_pass = "my_password" # my password is here
form_data={'Email': my_email, 'Passwd': my_pass}
post = "https://accounts.google.com/signin/challenge/sl/password"
with requests.Session() as s:
soup = BeautifulSoup(s.get("https://mail.google.com").text, "html.parser")
for inp in soup.select("#gaia_loginform input[name]"):
if inp["name"] not in form_data:
form_data[inp["name"]] = inp["value"]
s.post(post, form_data)
html = s.get("https://mail.google.com/mail/u/0/#inbox").content
print(my_email in s.get('https://mail.google.com/mail/u/0/#inbox').text) # Prints 'False', should print 'True'
As you can see the code at the end returns False. Furthermore, if I write the html to a file and open that in a browser, the page I get is the default Google login page indicating it has not worked.

not able to get another page when iam using python request session module to login

I am trying to login LinkedIn using python request session module but iam not able access other pages please help me out.
My code is like this
import requests
from bs4 import BeautifulSoup
# Get login form
URL = 'https://www.linkedin.com/uas/login'
session = requests.session()
login_response = session.get('https://www.linkedin.com/uas/login')
login = BeautifulSoup(login_response.text,"lxml")
# Get hidden form inputs
inputs = login.find('form', {'name': 'login'}).findAll('input',
{'type':
['hidden', 'submit']})
# Create POST data
post = {input.get('name'): input.get('value') for input in inputs}
post['session_key'] = 'usename'
post['session_password'] = 'password'
# Post login
post_response = session.post('https://www.linkedin.com/uas/login-
submit', data=post)
notify_response = session.get('https://www.linkedin.com/company-
beta/3067/')
notify = BeautifulSoup(notify_response.text,"lxml")
print notify.title
Well, hope I'm not saying wrong stuff, but I had to crawl linkedin some weeks ago and seen linkedin is pretty good at spoting bots. I'm almost sure it is your issue here (you should try to print output of post_response, you surelly you will see you are on a captcha page or something like that).
Plot twist: I succeed to login into linkedin by running selenium, login to linkedin by hand and use pickle to save cookies as text file.
Then, instead of using login form, I just loaded cookies to selenium and refresh page, tadam, logged in. I think this can be done with requests

Login to jsp website using Requests

I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.

Python SSL post using requests

The goal here is to be able to post username and password information to https://canvas.instructure.com/login so I can access and scrape information from a page once logged in.
I know the login information and the name of the login and password (pseudonym_session[user_id], and pseudonym_sessionp[password]) but I'm not sure how to use the requests.Session() to pass the login page.
import requests
s = requests.Session()
payload = {'pseudonym_session[user_id]': 'bond', 'pseudonym_session[password]': 'james bond'}
r = s.post('https://canvas.instructure.com/login', data=payload)
r = s.get('https://canvas.instructure.com/(The page I want)')
print(r.content)
Thanks for your time!
Actually the code posted works fine. I had a spelling error on my end with the password. Now I'm just using beautiful soup to find what I need on the page after logging in.
Put Chrome (or your browser of choice) into debug mode (Tools-> Developer Tools-> Network in Chrome) and do a manual login. Then follow closely what happens and replicate it in your code. I believe that is the only way, unless the website has a documented api.

Using Python to login webpage (cookies GET and POST involved)

I am trying to use Python 2.7.6 to login a website. the login logic contains 2 steps in 2 webpages.
Putting in user ID and password onto page A, and the page A gives a cookie;
This cookie is used in the header to authenticate the login on page B.
It only logs in successfully once B authenticated it.
There’s a post here, HTTP POST and GET with cookies for authentication in python, asking the similar question. A solution is using requests.
import requests
url_0 = "http://www.PAGE_A.com/" # http://webapp.pucrs.br/consulta/principal.jsp, in original example
url = "http://www.PAGE_B.com/" # https://webapp.pucrs.br/consulta/servlet/consulta.aluno.ValidaAluno, in original example
data = {"field1_username": "ABC", " field_password": "123"}
s = requests.session()
s.get(url_0)
r = s.post(url, data)
I tired used this in Python for my case and it doesn't return error message so I guess it works fine.
But the question is, how do I know it’s logged in?
I added below to print the logged in page to see if it returned the right page.
import mechanize
br = mechanize.Browser()
open_page = br.open("http://www.PAGE_B.com/")
read_page = open_page.read()
print read_page
However, it stills shows the contents before login. What went wrong?
How about just going with one of the two?
import mechanize;
browser = mechanize.Browser()
browser.addheaders = [('...')]
browser.open(YOUR_URL)
browser.select_form(FORM_NAME)
browser.form['USERNAME_FIELD'] = 'abc'
browser.form['PASSWORD_FIELD'] = 'password'
browser.submit()
print browser.response().read()
print browser.geturl()

Categories