Web-scraping a password protected website using Ghost.py - python

I'm trying to get the HTML content of a password protected site using Ghost.py.
The web server I have to access, has the following HTML code (I cut it just to the important parts):
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<script language="JavaScript">
function DoHash()
{
var psw = document.getElementById('psw_id');
var hpsw = document.getElementById('hpsw_id');
var nonce = hpsw.value;
hpsw.value = MD5(nonce.concat(psw.value));
psw.value = '';
return true;
}
</script>
</head>
<body>
<form action="PAGE.HTM" name="" method="post" onsubmit="DoHash();">
Access code <input id="psw_id" type="password" maxlength="15" size="20" name="q" value="">
<br>
<input type="submit" value="" name="q" class="w_bok">
<br>
<input id="hpsw_id" type="hidden" name="pA" value="180864D635AD2347">
</form>
</body>
</html>
The value of "#hpsw_id" changes every time you load the page.
On a normal browser, once you type the correct password and press enter or click the "submit" button, you land on the same page but now with the real contents.
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<!–– javascript is gone ––>
</head>
<body>
Welcome to PAGE.htm content
</body>
</html>
First I tried with mechanize but failed, as I need javascript. So now I´m trying to solve it using Ghost.py
My code so far:
import ghost
g = ghost.Ghost()
with g.start(wait_timeout=20) as session:
page, extra_resources = session.open("http://192.168.1.60/PAGE.htm")
if page.http_status == 200:
print("Good!")
session.evaluate("document.getElementById('psw_id').value='MySecretPassword';")
session.evaluate("document.getElementsByClassName('w_bok')[0].click();", expect_loading=True)
print session.content
This code is not loading the contents correctly, in the console I get:
Traceback (most recent call last): File "", line 8, in
File
"/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 181, in
wrapper
timeout=kwargs.pop('timeout', None)) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1196, in
wait_for_page_loaded
'Unable to load requested page', timeout) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1174, in
wait_for
raise TimeoutError(timeout_message) ghost.ghost.TimeoutError: Unable to load requested page
Two questions...
1) How can I successfully login to the password protected site and get the real content of PAGE.htm?
2) Is this direction the best way to go? Or I'm missing something completely which will make things work more efficiently?
I'm using Ubuntu Mate.

This is not the answer I was looking for, just a work-around to make it work (in case someone else has a similar issue in the future).
To skip the javascript part (which was stopping me to use python's request), I decided to do the expected hash on python (and not on web) and send the hash as the normal web form would do.
So the Javascript basically concatenates the hidden hpsw_id value and the password, and makes a md5 from it.
The python now looks like this:
import requests
from hashlib import md5
from re import search
url = "http://192.168.1.60/PAGE.htm"
with requests.Session() as s:
# Get hpsw_id number from website
r = s.get(url)
hpsw_id = search('name="pA" value="([A-Z0-9]*)"', r.text)
hpsw_id = hpsw_id.group(1)
# Make hash of ID and password
m = md5()
m.update(hpsw_id + 'MySecretPassword')
pA = m.hexdigest()
# Post to website to login
r = s.post(url, data=[('q', ''), ('q', ''), ('pA', pA)])
print r.content
Note: the q, q and pA are the elements that the form (q=&q=&pA=f08b97e5e3f472fdde4280a9aa408aaa) is sending when I login normally using internet browser.
If someone however knows the answer of my original question I would be very appreciated if you post it here.

Related

Python & request : send post with id?

I would like to submit a form on a webpage.
The page has however several forms :
<form method="post" action="https://mywebsite.com/pageA" id="order" class="order ajaxForm">
<input type="text" class="decimal" name="value" id="fieldA" value="0" />
</label>
</form>
<form method="post" action="https://mywebsite.com/pageB" id="previousorder" class="order ajaxForm">
<input type="text" class="decimal" name="value" id="fieldB" value="0" />
</label>
</form>
Is there an easy way to trigger a specific form using python & request ?
I'd go with some more advanced tools like mechanize or MechanicalSoup. The latter is actually based on requests internally (I assume you meant requests package by "request"). Both of these tools allow to "select a desired form" and then submit it specifying the required parameters.
For instance, submitting the order form with MechanicalSoup would look something like this:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://yourwebsite.com")
# Fill-in the order form
browser.select_form('#order')
browser["value"] = "100"
browser.submit_selected()
You have to look at the DevTools Network tab while posting a form.
Every form will have different request url and post parameters. Generally, what you will need to do with requests is something like that:
req = requests.post('https://mywebsite.com/pageB',
data = {'fieldB':'value_you_want_to_submit'})
But better first investigate it with DevTools.
Try something like this: (prob need to make some modifications but it will be close to what you want this example is for login form):
install lxml
import requests
from lxml import html
payload = {
"username": "<USER NAME>",
"password": "<PASSWORD>",
"csrfmiddlewaretoken": "<CSRF_TOKEN>"
}
sessionReq = requests.session()
login_url = "https://example.be/account/login.php"
result = sessionReq.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='csrfmiddlewaretoken']/#value")))[0]
result = sessionReq.post(login_url,data = payload, headers = dict(referer = login_url)
url = 'https://bitbucket.org/dashboard/overview'
I hope this helps you :)

Uploading a file to Python bottle server without HTML form

Currently I have an HTML form which chooses the file and upload it to server.
How to do it without HTML form.
<html>
<head></head>
<body>
<form action="/upload" method="post" enctype="multipart/form-data">
Select a file: <input type="file" name="uploadinc" />
<input type="submit" value="Start upload" />
</form>
</body>
</html>
And my bottle server contains the following code to upload.
#route('/UploadFiles', method='POST')
def UploadFiles():
print "inside upload files"
uploadinc = request.files.get('uploadinc')
uploadinc.save("/home/user/files/"+uploadinc.filename)
I want to directly save the file without HTML UI.
Like..
request.files.get("file location in local machine if it is fixed(C:\\a.txt)")
But it is getting as none. How to do it?
I am able to call the Rest API from a rest client like this.
How to do this call programatically ?
You may try Requests lib: POST a Multipart-Encoded File
How to do this call programatically ?
Modified example from Requests lib documentation:
import requests
url = 'http://10.208.53.89:7778/UploadFiles'
multiple_files = [
('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))
]
r = requests.post(url, files=multiple_files)
You want to upload files from the command line, instead of in a browser? Just use curl:
curl -F "image=#foo.png" -F "image2=#bar.png" http://localhost:8888/uploadFiles
Source.

How to call a postback in ASP.Net with Python

I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.
Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want

Unable to download website's contents using mechanize

I got a python script from here to download web contents from a course website:
from mechanize import Browser
b = Browser()
b.open("https://wiki.engr.illinois.edu/display/cs498cc/Home")
b.select_form(nr=0)
b["user"] = "myusername"
b["passwrd"] = "blabla"
b.submit()
response = b.response().read()
if "Salve <b>johnconnor</b>" in response:
print "Logged in!"
I'm getting an error:
mechanize._form.ControlNotFoundError: no control matching name 'user'
I'm not sure how to do this since I've just started learning python and discovered that library.
I've tried using the --user=X --password=Y flags with wget too, but it only downloads the login page!
The form elements have different names:
<input type="text" name="os_username" id="os_username" class="text " data-focus="0">
<input type="password" name="os_password" id="os_password" class="password ">
Change user to os_username and passwrd to os_password and it might work.

How can I simulate a web page form submit with a Python code-behind?

I have a critical issue. I would like integrate my application with another much older application. This service is simply a web form, probably behind a framework (I think ASP Classic maybe). I have an action URL, and I have the HTML code for replicating this service.
This is a piece of the old service (the HTML page):
<FORM method="POST"
url="https://host/path1/path2/AdapterHTTP?action_name=myactionWebAction&NEW_SESSION=true"
enctype="multipart/form-data">
<INPUT type="text" name="AAAWebView-FormAAA-field1" />
<INPUT type="hidden" name="AAAWebView-FormAAA-field2" value="" />
<INPUT type="submit" name="NAV__BUTTON__press__AAAWebView-FormAAA-enter" value="enter" />
</FORM>
My application should simulate form submission of this old application from code-behind with Python. For now, I didn't have so much luck.
For now I do this
import requests
payload = {'AAAWebView-FormAAA-field1': field1Value, \
'AAAWebView-FormAAA-field2': field2Value, \
'NAV__BUTTON__press__AAAWebView-FormAAA-enter': "enter"
}
url="https://host/path1/path2/AdapterHTTP?action_name=myactionWebAction&NEW_SESSION=true"
headers = {'content-type': 'multipart/form-data'}
r = requests.post(url, data=payload, headers=headers)
print r.status_code
I receive a 200 HTTP response code, but if I click on submit button on the HTML page, the action saves the values, but my code does not do the same. How do I fix this problem?
The owner of an old application sent me this Java exception log. Any ideas?
org.apache.commons.fileupload.FileUploadException: the request was rejected because no multipart boundary was found
Try passing an empty dictionary as files with requests.post. This will properly construct a request with multipart boundary I think.
r = requests.post(url, data=payload, headers=headers, files={})

Categories