Unable to download website's contents using mechanize

Unable to download website's contents using mechanize - python

I got a python script from here to download web contents from a course website:
from mechanize import Browser
b = Browser()
b.open("https://wiki.engr.illinois.edu/display/cs498cc/Home")
b.select_form(nr=0)
b["user"] = "myusername"
b["passwrd"] = "blabla"
b.submit()
response = b.response().read()
if "Salve <b>johnconnor</b>" in response:
print "Logged in!"
I'm getting an error:
mechanize._form.ControlNotFoundError: no control matching name 'user'
I'm not sure how to do this since I've just started learning python and discovered that library.
I've tried using the --user=X --password=Y flags with wget too, but it only downloads the login page!

The form elements have different names:
<input type="text" name="os_username" id="os_username" class="text " data-focus="0">
<input type="password" name="os_password" id="os_password" class="password ">
Change user to os_username and passwrd to os_password and it might work.

Related

How can I login to the site using requests module in python?

I want to login to the site below using requests module in python.
https://accounts.dmm.com/service/login/password
But I cannot find the "login_id" and "password" fields in the requests' response.
I CAN find them using "Inspect" menu in Chrome.
<input type="text" name="login_id" id="login_id" placeholder="メールアドレス" value="">
and
<input type="password" name="password" id="password" placeholder="パスワード" value="">
I tried to find them in the response from requests, but couldn't.
Here is my code:
import requests
url = 'https://accounts.dmm.com/service/login/password'
session = requests.session()
response = session.get(url)
with open('test_saved_login.html','w',encoding="utf-8")as file:
file.write(response.text) # Neither "login_id" nor "password" field found in the file.
How should I do?
Selenium is an easy solution, but I do not want to use it.

The login form is created with javascript. Try viewing the page in a browser with javascript disabled there will be no form. The people who control that site are trying to prevent people from doing exactly what you're trying to do. In addition to the fact the form elements don't appear (which really doesn't matter with requests,) they are also using a special token that you won't be able to guess which I expect is also in obfuscated javascript. So it is likely impracticable to script a login with requests and unless you have special permission from this company it is highly inadvisable that you continue with doing what you're trying to do.

Web-scraping a password protected website using Ghost.py

I'm trying to get the HTML content of a password protected site using Ghost.py.
The web server I have to access, has the following HTML code (I cut it just to the important parts):
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<script language="JavaScript">
function DoHash()
{
var psw = document.getElementById('psw_id');
var hpsw = document.getElementById('hpsw_id');
var nonce = hpsw.value;
hpsw.value = MD5(nonce.concat(psw.value));
psw.value = '';
return true;
}
</script>
</head>
<body>
<form action="PAGE.HTM" name="" method="post" onsubmit="DoHash();">
Access code <input id="psw_id" type="password" maxlength="15" size="20" name="q" value="">
<br>
<input type="submit" value="" name="q" class="w_bok">
<br>
<input id="hpsw_id" type="hidden" name="pA" value="180864D635AD2347">
</form>
</body>
</html>
The value of "#hpsw_id" changes every time you load the page.
On a normal browser, once you type the correct password and press enter or click the "submit" button, you land on the same page but now with the real contents.
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<!–– javascript is gone ––>
</head>
<body>
Welcome to PAGE.htm content
</body>
</html>
First I tried with mechanize but failed, as I need javascript. So now I´m trying to solve it using Ghost.py
My code so far:
import ghost
g = ghost.Ghost()
with g.start(wait_timeout=20) as session:
page, extra_resources = session.open("http://192.168.1.60/PAGE.htm")
if page.http_status == 200:
print("Good!")
session.evaluate("document.getElementById('psw_id').value='MySecretPassword';")
session.evaluate("document.getElementsByClassName('w_bok')[0].click();", expect_loading=True)
print session.content
This code is not loading the contents correctly, in the console I get:
Traceback (most recent call last): File "", line 8, in
File
"/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 181, in
wrapper
timeout=kwargs.pop('timeout', None)) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1196, in
wait_for_page_loaded
'Unable to load requested page', timeout) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1174, in
wait_for
raise TimeoutError(timeout_message) ghost.ghost.TimeoutError: Unable to load requested page
Two questions...
1) How can I successfully login to the password protected site and get the real content of PAGE.htm?
2) Is this direction the best way to go? Or I'm missing something completely which will make things work more efficiently?
I'm using Ubuntu Mate.

This is not the answer I was looking for, just a work-around to make it work (in case someone else has a similar issue in the future).
To skip the javascript part (which was stopping me to use python's request), I decided to do the expected hash on python (and not on web) and send the hash as the normal web form would do.
So the Javascript basically concatenates the hidden hpsw_id value and the password, and makes a md5 from it.
The python now looks like this:
import requests
from hashlib import md5
from re import search
url = "http://192.168.1.60/PAGE.htm"
with requests.Session() as s:
# Get hpsw_id number from website
r = s.get(url)
hpsw_id = search('name="pA" value="([A-Z0-9]*)"', r.text)
hpsw_id = hpsw_id.group(1)
# Make hash of ID and password
m = md5()
m.update(hpsw_id + 'MySecretPassword')
pA = m.hexdigest()
# Post to website to login
r = s.post(url, data=[('q', ''), ('q', ''), ('pA', pA)])
print r.content
Note: the q, q and pA are the elements that the form (q=&q=&pA=f08b97e5e3f472fdde4280a9aa408aaa) is sending when I login normally using internet browser.
If someone however knows the answer of my original question I would be very appreciated if you post it here.

Python, robobrowser, answer authentication-challenge after login

I'm really new to python programming. I'm working on automation of a web-browser. I started with selenium, but found it to be really slow for what I need.
I'm working on a code that can Login to a webpage and fill out few text-boxes and click on few buttons. I finally achieved the 1st part. My program can finally sign in using the robobrowser.
import re
from robobrowser import RoboBrowser
browser = RoboBrowser()
login_url = 'https://webbroker.td.com/waw/idp/login.htm?execution=e1s1'
browser.open(login_url)
form = browser.get_form(id="login")
form["login:AccessCard"].value = "****"
form["login:Webpassword"].value = "****"
browser.submit_form(form)
As soon as I log, this webpage is asking me to answer an authentication question.
<div class="td-layout-row td-margin-top-medium">
<div class="td-layout-column td-layout-grid15"><label class="questionText" for="MFAChallengeForm:answer" id="MFAChallengeForm:question">
What is your favourite TV show?</label></div>
</div>
<div class="td-layout-row">
<div class="td-layout-column td-layout-grid7"><input autocomplete="off" id="MFAChallengeForm:answer" maxlength="25" name="MFAChallengeForm:answer" onkeydown="trapEnter(event,'MFAChallengeForm',id,'next')" size="25" type="password" value=""/></div>
</div>
How do I carry on form here? I need to enter and submit my authentication answer in order to proceed. In selenium it would be something like this.
AQ = driver.find_element_by_id("MFAChallengeForm:answer")
AQ.send_keys("******")
*Click Submit*
However, how would I do it in robobrowser/lxml/beautifulsoup? I need to submit my answer(while sill being logged in). Thank you in advance.

How to get the value from html textbox and print it in terminal through a python script?

I am working in a Linux VNC. It has an older python version 2.7 and so I couldn't use any of predefined GUI packages which are not available in this version. So I need an alternative to do that. Also with these below codes, the code is getting redirected to html page but at the same time the cgi script also runs and prints the output in the terminal as "None" value. So I want the html part to be done first and after giving the input in webpage and clicking the submit button, i must get that value in terminal. How can i do that? Also cgi-bin folder is not there, so on clicking the submit button, getting a File Not Found error as well. Please help me through this...
test.py
import webbrowser
def func1():
f = open('helloworld.html','w')
message = """<html>
<head></head>
<body>
<form name="get" action="/cgi-bin/cgi_test.py" method="get">
Name : <input type="text" name="nm"/></br>
<input type="submit" value="Submit"/>
</form>
</body>
</html>"""
f.write(message)
f.close()
webbrowser.open('helloworld.html')
cgi_test.py
import test
import cgi
test.func1()
form=cgi.FieldStorage()
n=form.getvalue('nm')
print (n)

How to call a postback in ASP.Net with Python

I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.

Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to download website's contents using mechanize - python

The form elements have different names: <input type="text" name="os_username" id="os_username" class="text " data-focus="0"> <input type="password" name="os_password" id="os_password" class="password "> Change user to os_username and passwrd to os_password and it might work.

Related

How can I login to the site using requests module in python?

Web-scraping a password protected website using Ghost.py

Python, robobrowser, answer authentication-challenge after login

How to get the value from html textbox and print it in terminal through a python script?

How to call a postback in ASP.Net with Python

Categories

Resources