Iterating through select items on AJAX page with Scrapy and Splash - python

I am scraping a page, using both Scrapy and Splash. The page contains a dropdown box (technically, a select HTML element). Each time an element is selected in the dropdown box, a new page is loaded using AJAX.
The HTML segment below, is a simplified version of the page I'm processing:
<html>
<head><title>Title goes here ...</title></head>
<body>
<select class="foo">
<option value=100 data-reactid=1>One</option>
<option value=200 data-reactid=2>Two</option>
<!-- ... -->
<option value=900 data-reactid=9>Nine</option>
</select>
</body>
</html>
Snippet of my scrapy/splash code:
# Fetch the options ... now what ?
options = response.css("select[class=foo] option[data-reactid]")
How do I programatically use Splash to 'click' and receive the reloaded AJAX page in my response object?

You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like:
...
script = """
function main(splash)
splash.resource_timeout = 10
splash:go(splash.args.url)
splash:wait(1)
splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
# base_url refers to page with the select
values = response.xpath('//select[#class="foo"]/option/#value').extract()
for value in values:
yield scrapy_splash.SplashRequest(
base_url, self.parse_result, endpoint='execute',
args={'lua_source': script, 'value': value, 'timeout': 3600})
Of course, this isn't tested, but you might start there and play with it.

Related

Flask - href to anchor on a different page (navigation bar)

I am using Flask to develop a web app. On the home page (index.html), the navigation bar navigates one to specific sections on the page using anchors:
<a class='text' href='#body2'>calculate</a>
<a id="body2"></a>
On the home page, there is a form which links you to a new page (output.html). I want the same navigation bar to navigate a user to the previous page (index.html) and the specific sections. I have written the navigation links on the second page as shown below:
<a class='text' href="{{ url_for('index') }}#body2">calculate</a>
When I click the navigation links, the new page does not load. However, this is the strange thing, when I inspect the navigation link element in my browser and click the link through the inspect client, it does take me to the correct page/section.
If I remove '#body2' from the above line, it successfully navigates me to the previous page, but not to the specific section.
(If you want to physically try out the navigation links on the web app, use the following link:
http://yourgreenhome.appspot.com/ - Enter some random values into the blank form entries and it will take you to the second page. It is running through Google's App Engine but this is definitely not causing the problem because the problem still occurs when I run the site on local host).
You have an error in smoothscroll.js
$(document).ready(function(){
$("a").on('click', function(event) {
if (this.hash !== "") {
event.preventDefault();
var hash = this.hash;
$('html, body').animate({
scrollTop: $(hash).offset().top
}, 800, function(){
window.location.hash = hash;
});
}
});
});
In advpy page, $(hash).offset() is undefined, thus top is undefined. Because you are preventing the default event (event.preventDefault();) the click on the link doesn't occur.

Get a html site "input" element by python selenium

I'm really stuck here.
I use Python + Selenium to automate a website form filling.
So I gave some data into the webpage, then "click" on a button, after that a new value appears in an element and I would like to get that value, but I stuck.
How should I get that value into a variable?
I tried to use find_element_by_xpath what works for "click", works for "send.keys", but to get any value from here, nothing.
Please help me!
Picture enclosed about the webpage inspection.
1
'''python
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
browser.get('https://...')
browser.find_element_by_xpath('//input[parameter-
name="moduleId"]').send_keys('1234')
'''
So by that point everything is ok. I can fill the "moduleId" element with the 1234 value.
but from here I cannot read it back.
so If I try like this:
'''python
moduleId = browser.find_element_by_xpath('//input[#parameter-name="moduleId"]')
'''
the output is nothing.
Here is the HTML part of the website what is interesting.
html
<input type="text" name="parameterValue" class="form-control"
placeholder="Value" spellcheck="false" autocomplete="off" data-bind="value:
value, valueUpdate: 'keyup', autocomplete: { options: options, filtered:
true }, attr: { 'parameter-name': name, type: inputType }" parameter-name="moduleId">
Use this to get the value of the input element:
input.get_attribute('value')
You could also do this all in one shot, example
browser.find_element_by_xpath('//input[#parameter-name="moduleId"]').get_attribute('value')
Thx4 #scilence
I made this modification:
moduleId = browser.find_element_by_xpath('//input[#parameter-name="moduleId"]')
moduleId_value = moduleId.get_attribute('value')
print("data: ",moduleId_value)
and the output is what I need!

How to call a postback in ASP.Net with Python

I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.
Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want

Navigating to ScienceDirect's NextPage using scrapy

Using Scrapy, how to navigate to "nextpage" link, from any results page generate by sciencedirect.com?
The nextpage link is the input element:
<div class="paginationBar">
<span style="color:#A4A4A4;" aria-disabled="true" alt="Previous Page" title="Previous Page"><< Previous</span>
<span class="pageText">Page 1 of 20462</span>
<input class="nextPrev" type="submit" title="Next Page" alt="Next Page" name="bottomNext" onmouseout="this. className='nextPrev'" onmouseover="this.className='nextPrevHov'" value="Next >>">
</div>
And exists some javascript but I dont know how to take it :(
The answer is simple: there is no JavaScript involved.
If you look at the site you can see, that the link Next >> is an input field which submits the form.
When looking at the form itself, you can see, that it sends a get request to a site. The input fields to this request you can gather together and then yield a new Request with Scrapy to scrape the next site.
An example would be:
form = response.xpath('//form[#name="Tag"]')[0]
url = 'http://www.sciencedirect.com/science/?'
for inp in form.xpath('.//input[#type="hidden"]'):
url += inp.xpath('./#name').extract()[0]+'='+inp.xpath('./#value').extract()[0]+'&'
url += 'bottomNext=Next+%3E%3E&resultsPerPage=25'
yield Request(url)
Naturally some error handling is needed (for example after 1000 results you cannot view more so you will get an error site which does not have the form).

Python: Clicking a button with urllib or urllib2

I want to click a button with python, the info for the form is automatically filled by the webpage. the HTML code for sending a request to the button is:
INPUT type="submit" value="Place a Bid">
How would I go about doing this?
Is it possible to click the button with just urllib or urllib2? Or will I need to use something like mechanize or twill?
Use the form target and send any input as post data like this:
<form target="http://mysite.com/blah.php" method="GET">
......
......
......
<input type="text" name="in1" value="abc">
<INPUT type="submit" value="Place a Bid">
</form>
Python:
# parse the page HTML with the form to get the form target and any input names and values... (except for a submit and reset button)
# You can use XML.dom.minidom or htmlparser
# form_target gets parsed into "http://mysite.com/blah.php"
# input1_name gets parsed into "in1"
# input1_value gets parsed into "abc"
form_url = form_target + "?" + input1_name + "=" + input1_value
# form_url value is "http://mysite.com/blah.php?in1=abc"
# Then open the new URL which is the same as clicking the submit button
s = urllib2.urlopen(form_url)
You can parse the HTML with HTMLParser
And don't forget to urlencode any post data with:
urllib.urlencode(query)
You may want to take a look at IronWatin - https://github.com/rtyler/IronWatin to fill the form and "click" the button using code.
Using urllib.urlopen, you could send the values of the form as the data parameter to the page specified in the form tag. But this won't automate your browser for you, so you'd have to get the form values some other way first.

Categories