I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.
Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want
Related
HTML Login button :
HTML User_id : <input name="userid" type="text" tabindex="1" class="login_input" value="" onfocus="check_userid_on()" onclick="check_userid_on()" onblur="check_userid_off()">
I attached the picture of user_id, user_pw, and sign-in button page source for better understanding.
https://i.stack.imgur.com/kr8Nf.png
https://i.stack.imgur.com/irZJz.png
In python, I want to insert user_id and user_pw then login in using login button that has a javascript function called loginsendit()
So far my code starts like
LOGIN_INFO = {
'userId': 'myidid',
'userPassword': 'mypassword123'
}
user_id = soup.find('input' , {'name': 'userid'})
user_id['value'] = LOGIN_INFO['userid']
user_pw = soup.find('input', {'name': 'userpw'})
user_pw['value'] = LOGIN_INFO['userPassword']
login_req = s.post('url', data='loginSendit()')
print(login_req.status_code)
But it only prints out 200 even if the password or username is wrong, which means my code doesn't let me log in.
Can you help me how to call this loginsendit() javascript func in Python?
If you want to execute JavaScript then look for phantomjs - general tutorial, threads: Running javascript in Selenium using Python and Executing Javascript on Selenium/PhantomJS.
I'm trying to get the HTML content of a password protected site using Ghost.py.
The web server I have to access, has the following HTML code (I cut it just to the important parts):
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<script language="JavaScript">
function DoHash()
{
var psw = document.getElementById('psw_id');
var hpsw = document.getElementById('hpsw_id');
var nonce = hpsw.value;
hpsw.value = MD5(nonce.concat(psw.value));
psw.value = '';
return true;
}
</script>
</head>
<body>
<form action="PAGE.HTM" name="" method="post" onsubmit="DoHash();">
Access code <input id="psw_id" type="password" maxlength="15" size="20" name="q" value="">
<br>
<input type="submit" value="" name="q" class="w_bok">
<br>
<input id="hpsw_id" type="hidden" name="pA" value="180864D635AD2347">
</form>
</body>
</html>
The value of "#hpsw_id" changes every time you load the page.
On a normal browser, once you type the correct password and press enter or click the "submit" button, you land on the same page but now with the real contents.
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<!–– javascript is gone ––>
</head>
<body>
Welcome to PAGE.htm content
</body>
</html>
First I tried with mechanize but failed, as I need javascript. So now I´m trying to solve it using Ghost.py
My code so far:
import ghost
g = ghost.Ghost()
with g.start(wait_timeout=20) as session:
page, extra_resources = session.open("http://192.168.1.60/PAGE.htm")
if page.http_status == 200:
print("Good!")
session.evaluate("document.getElementById('psw_id').value='MySecretPassword';")
session.evaluate("document.getElementsByClassName('w_bok')[0].click();", expect_loading=True)
print session.content
This code is not loading the contents correctly, in the console I get:
Traceback (most recent call last): File "", line 8, in
File
"/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 181, in
wrapper
timeout=kwargs.pop('timeout', None)) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1196, in
wait_for_page_loaded
'Unable to load requested page', timeout) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1174, in
wait_for
raise TimeoutError(timeout_message) ghost.ghost.TimeoutError: Unable to load requested page
Two questions...
1) How can I successfully login to the password protected site and get the real content of PAGE.htm?
2) Is this direction the best way to go? Or I'm missing something completely which will make things work more efficiently?
I'm using Ubuntu Mate.
This is not the answer I was looking for, just a work-around to make it work (in case someone else has a similar issue in the future).
To skip the javascript part (which was stopping me to use python's request), I decided to do the expected hash on python (and not on web) and send the hash as the normal web form would do.
So the Javascript basically concatenates the hidden hpsw_id value and the password, and makes a md5 from it.
The python now looks like this:
import requests
from hashlib import md5
from re import search
url = "http://192.168.1.60/PAGE.htm"
with requests.Session() as s:
# Get hpsw_id number from website
r = s.get(url)
hpsw_id = search('name="pA" value="([A-Z0-9]*)"', r.text)
hpsw_id = hpsw_id.group(1)
# Make hash of ID and password
m = md5()
m.update(hpsw_id + 'MySecretPassword')
pA = m.hexdigest()
# Post to website to login
r = s.post(url, data=[('q', ''), ('q', ''), ('pA', pA)])
print r.content
Note: the q, q and pA are the elements that the form (q=&q=&pA=f08b97e5e3f472fdde4280a9aa408aaa) is sending when I login normally using internet browser.
If someone however knows the answer of my original question I would be very appreciated if you post it here.
I am scraping a page, using both Scrapy and Splash. The page contains a dropdown box (technically, a select HTML element). Each time an element is selected in the dropdown box, a new page is loaded using AJAX.
The HTML segment below, is a simplified version of the page I'm processing:
<html>
<head><title>Title goes here ...</title></head>
<body>
<select class="foo">
<option value=100 data-reactid=1>One</option>
<option value=200 data-reactid=2>Two</option>
<!-- ... -->
<option value=900 data-reactid=9>Nine</option>
</select>
</body>
</html>
Snippet of my scrapy/splash code:
# Fetch the options ... now what ?
options = response.css("select[class=foo] option[data-reactid]")
How do I programatically use Splash to 'click' and receive the reloaded AJAX page in my response object?
You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like:
...
script = """
function main(splash)
splash.resource_timeout = 10
splash:go(splash.args.url)
splash:wait(1)
splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
# base_url refers to page with the select
values = response.xpath('//select[#class="foo"]/option/#value').extract()
for value in values:
yield scrapy_splash.SplashRequest(
base_url, self.parse_result, endpoint='execute',
args={'lua_source': script, 'value': value, 'timeout': 3600})
Of course, this isn't tested, but you might start there and play with it.
Using Scrapy, how to navigate to "nextpage" link, from any results page generate by sciencedirect.com?
The nextpage link is the input element:
<div class="paginationBar">
<span style="color:#A4A4A4;" aria-disabled="true" alt="Previous Page" title="Previous Page"><< Previous</span>
<span class="pageText">Page 1 of 20462</span>
<input class="nextPrev" type="submit" title="Next Page" alt="Next Page" name="bottomNext" onmouseout="this. className='nextPrev'" onmouseover="this.className='nextPrevHov'" value="Next >>">
</div>
And exists some javascript but I dont know how to take it :(
The answer is simple: there is no JavaScript involved.
If you look at the site you can see, that the link Next >> is an input field which submits the form.
When looking at the form itself, you can see, that it sends a get request to a site. The input fields to this request you can gather together and then yield a new Request with Scrapy to scrape the next site.
An example would be:
form = response.xpath('//form[#name="Tag"]')[0]
url = 'http://www.sciencedirect.com/science/?'
for inp in form.xpath('.//input[#type="hidden"]'):
url += inp.xpath('./#name').extract()[0]+'='+inp.xpath('./#value').extract()[0]+'&'
url += 'bottomNext=Next+%3E%3E&resultsPerPage=25'
yield Request(url)
Naturally some error handling is needed (for example after 1000 results you cannot view more so you will get an error site which does not have the form).
I want to click a button with python, the info for the form is automatically filled by the webpage. the HTML code for sending a request to the button is:
INPUT type="submit" value="Place a Bid">
How would I go about doing this?
Is it possible to click the button with just urllib or urllib2? Or will I need to use something like mechanize or twill?
Use the form target and send any input as post data like this:
<form target="http://mysite.com/blah.php" method="GET">
......
......
......
<input type="text" name="in1" value="abc">
<INPUT type="submit" value="Place a Bid">
</form>
Python:
# parse the page HTML with the form to get the form target and any input names and values... (except for a submit and reset button)
# You can use XML.dom.minidom or htmlparser
# form_target gets parsed into "http://mysite.com/blah.php"
# input1_name gets parsed into "in1"
# input1_value gets parsed into "abc"
form_url = form_target + "?" + input1_name + "=" + input1_value
# form_url value is "http://mysite.com/blah.php?in1=abc"
# Then open the new URL which is the same as clicking the submit button
s = urllib2.urlopen(form_url)
You can parse the HTML with HTMLParser
And don't forget to urlencode any post data with:
urllib.urlencode(query)
You may want to take a look at IronWatin - https://github.com/rtyler/IronWatin to fill the form and "click" the button using code.
Using urllib.urlopen, you could send the values of the form as the data parameter to the page specified in the form tag. But this won't automate your browser for you, so you'd have to get the form values some other way first.