Using Scrapy, how to navigate to "nextpage" link, from any results page generate by sciencedirect.com?
The nextpage link is the input element:
<div class="paginationBar">
<span style="color:#A4A4A4;" aria-disabled="true" alt="Previous Page" title="Previous Page"><< Previous</span>
<span class="pageText">Page 1 of 20462</span>
<input class="nextPrev" type="submit" title="Next Page" alt="Next Page" name="bottomNext" onmouseout="this. className='nextPrev'" onmouseover="this.className='nextPrevHov'" value="Next >>">
</div>
And exists some javascript but I dont know how to take it :(
The answer is simple: there is no JavaScript involved.
If you look at the site you can see, that the link Next >> is an input field which submits the form.
When looking at the form itself, you can see, that it sends a get request to a site. The input fields to this request you can gather together and then yield a new Request with Scrapy to scrape the next site.
An example would be:
form = response.xpath('//form[#name="Tag"]')[0]
url = 'http://www.sciencedirect.com/science/?'
for inp in form.xpath('.//input[#type="hidden"]'):
url += inp.xpath('./#name').extract()[0]+'='+inp.xpath('./#value').extract()[0]+'&'
url += 'bottomNext=Next+%3E%3E&resultsPerPage=25'
yield Request(url)
Naturally some error handling is needed (for example after 1000 results you cannot view more so you will get an error site which does not have the form).
Related
So there is a link from a website I have been trying to access using python requests library. Normally on clicking the button, it redirects to another website but copying and pasting the referrer link either in the browser directly or using requests.get() only returns the referrer page.
The link to the referrer page is: "https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7"
Here's the html with the button
<a
href="https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download"
class="btn"
type="submit"
title="Download Video"
>
<i class="fas fa-download"></i> Download <i class="fas fa-file-video"></i>
<span class="small-text">(Video)</span>
</a>
if I try to copy and paste the link ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download") directly in browser, it redirects to this link ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7") instead of this ("https://www.sabishare.com/file/mHxiMiZHW15-alchemy-of-souls-s01e07-netnaija-com-mp4")
so the only way to get to this url ("https://www.sabishare.com/file/mHxiMiZHW15-alchemy-of-souls-s01e07-netnaija-com-mp4") is by clicking the button in this page ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7").
Also, this is my python code:
def gen_link(url):
headers = {
'Authorization': 'Bearer {token}',
'Content-Type':'application/json',
}
print(dUrl)
resp = requests.get(dUrl, headers=headers, allow_redirects=True)
print(resp.url)
how is it that the destination url is somewhat blocked and can only be accessed if i click the button from the referrer webpage?
The issues with your script is the lack of the http request referer header
Here is a YouTube video to further explain this Unable to access link except by clicking button on websit
Here is the code snippet
import requests
url='https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download'
headers={'Referer':'https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7'}
resp = requests.get(url,headers=headers)
print(resp.text)
I am scraping a page, using both Scrapy and Splash. The page contains a dropdown box (technically, a select HTML element). Each time an element is selected in the dropdown box, a new page is loaded using AJAX.
The HTML segment below, is a simplified version of the page I'm processing:
<html>
<head><title>Title goes here ...</title></head>
<body>
<select class="foo">
<option value=100 data-reactid=1>One</option>
<option value=200 data-reactid=2>Two</option>
<!-- ... -->
<option value=900 data-reactid=9>Nine</option>
</select>
</body>
</html>
Snippet of my scrapy/splash code:
# Fetch the options ... now what ?
options = response.css("select[class=foo] option[data-reactid]")
How do I programatically use Splash to 'click' and receive the reloaded AJAX page in my response object?
You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like:
...
script = """
function main(splash)
splash.resource_timeout = 10
splash:go(splash.args.url)
splash:wait(1)
splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
# base_url refers to page with the select
values = response.xpath('//select[#class="foo"]/option/#value').extract()
for value in values:
yield scrapy_splash.SplashRequest(
base_url, self.parse_result, endpoint='execute',
args={'lua_source': script, 'value': value, 'timeout': 3600})
Of course, this isn't tested, but you might start there and play with it.
I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.
Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want
I'm using Python 3.3 and the Requests library to do a basic POST request.
I want to simulate what happens if you manually enter information into the browser from the webpage:
https://www.dspayments.com/FAIRFAX. For example, at that url, enter "x" for the license plate and Virginia as the state. Then the url changes to: https://www.dspayments.com/FAIRFAX/Home/PayOption, and it displays the desired information (I care about the source code of this second webpage).
I looked through the source code of the above two url's. Doing "inspect element" on the text boxes of the first url I found some things that need to be included in the post request: {'Plate':"x", 'PlateStateProv':"VA", "submit":"Search"}.
Then the second website (ending in /PayOption), had the raw html:
<form action="/FAIRFAX/Home/PayOption" method="post"><input name="__RequestVerificationToken" type="hidden" value="6OBKbiFcSa6tCqU8k75uf00m_byjxANUbacPXgK2evexESNDz_1cwkUpVVePA2czBLYgKvdEK-Oqk4WuyREi9advmDAEkcC2JvfG2VaVBWkvF3O48k74RXqx7IzwWqSB5PzIJ83P7C5EpTE1CwuWM9MGR2mTVMWyFfpzLnDfFpM1" /><div class="validation-summary-valid" data-valmsg-summary="true">
I then used the name:value pairs from the above html as keys and values in my payload dictionary of the post request. I think the problem is that in the second url, there is the "__RequestVerificationToken" which seems to have a randomly generated value every time.
How can I properly POST to this website? A "correct" answer would be one that produces the same source code on the website ending in "/PayOption" as if you manually enter "x" as the plate number and Virginia as the state and click submit on the first url.
My code is:
import requests
url1 = r'https://www.dspayments.com/FAIRFAX'
url2 = r'https://www.dspayments.com/FAIRFAX/Home/PayOption'
s = requests.Session()
#GET request
r = s.get(url1)
text1 = r.text
startstr = '<input name="__RequestVerificationToken" type="hidden" value="'
start_ind = text1.find(startstr)+len(startstr)
end_ind = text1.find('"',start_ind)
auth_string = text1[start_ind:end_ind]
#POST request
payload = {'Plate':'x', 'PlateStateProv':'VA',"submit":"Search",
"__RequestVerificationToken":auth_string,"validation-summary-valid":"true"}
post = s.post(url2, headers=user_agent, data=payload)
source_code = post.text
Thanks, -K.
You should only need the data from the first page, and as you say, the __RequestVerificationToken changes with each request.
You'll have to do something like:
GET request to https://www.dspayments.com/FAIRFAX
harvest __RequestVerificationToken value (Requests Session will take care of any associated cookies)
POST using the data you scraped from the GET request
extract whatever you need from the 2nd page
So, just focus on creating a form that's exactly like the one in the first page. Have a stab at it and if you're still struggling I can help dig into the particulars.
I want to go to a site and click on a button or link for logging in. But login does not use form.
I think login procedure use javascript.
Input for username:
<input tabindex="1"
class="dxeEditArea_Office2003Blue dxeEditAreaSys" onkeydown="aspxEKeyDown('ctl00_wucLogin1_txtUID', event)"
name="ctl00$wucLogin1$txtUID"
onkeyup="aspxEKeyUp('ctl00_wucLogin1_txtUID', event)"
type="text"
id="ctl00_wucLogin1_txtUID_I"
onblur="aspxELostFocus('ctl00_wucLogin1_txtUID')"
onfocus="aspxEGotFocus('ctl00_wucLogin1_txtUID')"
onkeypress="aspxEKeyPress('ctl00_wucLogin1_txtUID', event)"
style="height:15px;">
The link for login is :
<a id="ctl00_wucLogin1_BtnLogin"
class="Search_button"
href="javascript:__doPostBack('ctl00$wucLogin1$BtnLogin','')"
style="...">Login
</a>
How can I click on this link and how can I fill than input for username by twill?
is there any other alternative for twill?
Thanks,
Try sending the User-Agent header with the request. In any case, twill doesn't handle JavaScript well, so you're better off trying something else.
For alternatives, there are:
Mechanize
Requests
BeautifulSoup (for parsing HTML)
Selenium (python bindings)