So there is a link from a website I have been trying to access using python requests library. Normally on clicking the button, it redirects to another website but copying and pasting the referrer link either in the browser directly or using requests.get() only returns the referrer page.
The link to the referrer page is: "https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7"
Here's the html with the button
<a
href="https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download"
class="btn"
type="submit"
title="Download Video"
>
<i class="fas fa-download"></i> Download <i class="fas fa-file-video"></i>
<span class="small-text">(Video)</span>
</a>
if I try to copy and paste the link ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download") directly in browser, it redirects to this link ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7") instead of this ("https://www.sabishare.com/file/mHxiMiZHW15-alchemy-of-souls-s01e07-netnaija-com-mp4")
so the only way to get to this url ("https://www.sabishare.com/file/mHxiMiZHW15-alchemy-of-souls-s01e07-netnaija-com-mp4") is by clicking the button in this page ("https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7").
Also, this is my python code:
def gen_link(url):
headers = {
'Authorization': 'Bearer {token}',
'Content-Type':'application/json',
}
print(dUrl)
resp = requests.get(dUrl, headers=headers, allow_redirects=True)
print(resp.url)
how is it that the destination url is somewhat blocked and can only be accessed if i click the button from the referrer webpage?
The issues with your script is the lack of the http request referer header
Here is a YouTube video to further explain this Unable to access link except by clicking button on websit
Here is the code snippet
import requests
url='https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7/download'
headers={'Referer':'https://www.thenetnaija.net/videos/kdrama/16426-alchemy-of-souls/season-1/episode-7'}
resp = requests.get(url,headers=headers)
print(resp.text)
Related
I have a flask application with a few custom built tools. I'm trying to bring in some other tools into that flask application to have a single place for everything. One of those tools is MicroStrategy. I'm rendering a template and the MicroStrategy login page is working, but when I log in, it just kicks me back to the login page. When I look at the request, there are two Set-Cookie's in the header with errors.
Is it possible to do what I'm trying to do? A way to read the headers from the MicroStrategy page in the iframe and modify SameSite=None?
Here is my flask app:
#dash_app.server.route("/mstr")
def mstr():
resp = make_response(render_template("mstr.html"))
return resp
mstr.html:
<div style="position:fixed; width:100%; top:50px; left:0px; right:0px; bottom:0px; z-index:1;">
<iframe src="https://webserver.com/MicroStrategy/asp/Main.aspx" title="MicroStrategy" style="width:100%; height:100%; border:none; margin:0; padding:0; overflow:hidden;"></iframe>
</div>
I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.
Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want
Using Scrapy, how to navigate to "nextpage" link, from any results page generate by sciencedirect.com?
The nextpage link is the input element:
<div class="paginationBar">
<span style="color:#A4A4A4;" aria-disabled="true" alt="Previous Page" title="Previous Page"><< Previous</span>
<span class="pageText">Page 1 of 20462</span>
<input class="nextPrev" type="submit" title="Next Page" alt="Next Page" name="bottomNext" onmouseout="this. className='nextPrev'" onmouseover="this.className='nextPrevHov'" value="Next >>">
</div>
And exists some javascript but I dont know how to take it :(
The answer is simple: there is no JavaScript involved.
If you look at the site you can see, that the link Next >> is an input field which submits the form.
When looking at the form itself, you can see, that it sends a get request to a site. The input fields to this request you can gather together and then yield a new Request with Scrapy to scrape the next site.
An example would be:
form = response.xpath('//form[#name="Tag"]')[0]
url = 'http://www.sciencedirect.com/science/?'
for inp in form.xpath('.//input[#type="hidden"]'):
url += inp.xpath('./#name').extract()[0]+'='+inp.xpath('./#value').extract()[0]+'&'
url += 'bottomNext=Next+%3E%3E&resultsPerPage=25'
yield Request(url)
Naturally some error handling is needed (for example after 1000 results you cannot view more so you will get an error site which does not have the form).
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
I am trying to submit data to a form in a webpage. But the form is actually in a re-directed page from the main/ login page. How do i go the redirected page using python. I am using requests, I could not use browser as I was getting SSL certificate error while login authentication to this page.
Page code:
<!--- Content Frame --->
<iframe id="contentFrame" name="contentFrame" src="/action/timesheet.action? action=TimesheetRedirectAction" scrolling="auto" frameBorder="0"></iframe>
<!--- WFM jQuery code --->
my code:
import requests
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
website = 'https://qtime.qualcomm.com/login.jsp'
r = requests.get(website, auth=('username', 'password'),verify=False)
if r.status_code == 200:
print "Login successful"
print r.content
print r.url