I've looked at quite a few suggestions for clicking on buttons on web pages using python, but don't fully understand what the examples are doing and can't get them to work (particularly the ordering and combination of values).
I'm trying to download a PDF from a web site.
The first time you click on the PDF to download it, it takes you to a page where you have to click on "Agree and Proceed". Once you do that the browser stores the cookie (so you never need to agree again) and then opens the PDF in the browser (which is what I want to download).
Here is the link to the accept page - https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753"
I've used Chrome Developer to get this:-
<form name="showAnnouncementPDFForm" method="post" action="announcementTerms.do">
<input value="Decline" onclick="window.close();return false;" type="submit">
<input value="Agree and proceed" type="submit">
<input name="pdfURL" value="/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf" type="hidden">
</form>
and this is the final page you get to:- "https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf"
then tried to use it like this:-
import requests
values = {}
values['showAnnouncementRDFForm'] = 'announcementTerms.do'
values['pdfURL'] = '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'
req = requests.post('https://asx.com.au/', data=values)
print(req.text)
I tried a variety of URL's and changed what values I provide, but I don't think it's working correctly.
The print at the end provides me with what looks like the HTML form a web page. I'm not sure exactly what it is as I'm doing this from the command line of a server I'm ssh'd into (Pi). But I'm confident it's not the PDF I'm after.
As a final solution I'd like the python code to do is take the PDF link, automatically Agree and Proceed, store the cookie for next to to avoid future approvals, then download the PDF.
Hope that made sense and thanks for taking the time to read my question.
Markus
If you want to download the file directly and you know the URL you can access it without using a cookie:
import requests
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf")
with open('./test1.pdf', 'wb') as f:
f.write(response.content)
If you don't know the URL you can read it from the form then access it directly without a cookie:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
response = requests.get(f'{base_url}{pdf_url}')
with open('./test2.pdf', 'wb') as f:
f.write(response.content)
if you want to set the cookie:
import requests
cookies = {'companntc': 'tc'}
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf", cookies=cookies)
with open('./test3.pdf', 'wb') as f:
f.write(response.content)
If you really want to use POST:
import requests
payload = {'pdfURL': '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'}
response = requests.post('https://www.asx.com.au/asx/statistics/announcementTerms.do', params=payload)
with open('./test4.pdf', 'wb') as f:
f.write(response.content)
Or read the pdfURL from the form and do a POST:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
payload = {'pdfURL': pdf_url}
response = requests.post(f"{base_url}/asx/statistics/announcementTerms.do", params=payload)
with open('./test5.pdf', 'wb') as f:
f.write(response.content)
Related
I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks
You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)
You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.
I want to submit a multipart/form-data that sets the input for a simulation on TRILEGAL, and download the file available from a redirected page.
I studied documentation of requests, urllib, Grab, mechanize, etc. , and it seems that in mechanize my code would be :
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
However, I could not test it because it is not available in python 3.
So I tried requests :
import requests
url = 'http://stev.oapd.inaf.it/cgi-bin/trilegal'
values = {'gal_coord':"2",
'eq_alpha':"277.981111",
'eq_delta':"-19.0833",
'field':" 0.047117",
}
r = requests.post(url, files = values)
but I can't figure out how to get to the results page - if I do
r.content
it displays the content of the form that I had just submitted, whereas if you open the actual website, and click 'submit', you see a new window (following the method="post" action="./trilegal_1.6" ).
How can I get to that new window with requests (i.e. follow to the page that opens up when I click the submit button) , and click the link on the results page to retrieve the results file ( "The results will be available after about 2 minutes at THIS LINK.") ?
If you can point me to any other tool that could do the job I would be really grateful - I spent hours looking through SO for something that could help solve this problem.
Thank you!
Chris
Here is working solution for python 2.7
from mechanize import Browser
from urllib import urlretrieve # for download purpose
from bs4 import BeautifulSoup
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
base_url = 'http://stev.oapd.inaf.it'
# fetch the url from page source and it to base url
link = soup.findAll('a')[0]['href'].split('..')[1]
url = base_url + str(link)
filename = 'test.dat'
# now download the file
urlretrieve(url, filename)
Your file will be downloaded as test.dat. You can open it with respective program.
I post a separate answer because it would be too cluttered. Thanks to #ksai, this works in python 2.7 :
import re
import time
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
#set appropriate form contents
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = "277.981111"
browser['eq_delta'] = "-19.0833"
browser['field'] = " 0.047117"
browser['photsys_file'] = ["tab_mag_odfnew/tab_mag_lsst.dat"]
browser["icm_lim"] = "3"
browser["mag_lim"] = "24.5"
response = browser.submit()
# wait 1 min while results are prepared
time.sleep(60)
# select the appropriate url
url = 'http://stev.oapd.inaf.it/' + str(browser.links()[0].url[3:])
# download the results file
browser.retrieve(url, 'test1.dat')
Thank you very much!
Chris
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)
This might be a little too much direct question. New to Python
I am trying to Parse/Scrape video link from a video website(Putlocker).
Ie http://www.putlocker.com/file/A189D40E3E612C50.
The page comes up initially with this code below or similar
<form method="post">
<input type="hidden" value="3d0865fbb040e670" name="hash">
<input name="confirm" type="submit" value="Continue as Free User"
disabled="disabled"
id="submitButton" class="confirm_button" style="width:190px;">
</form>
value="3d0865fbb040e670" Changes everytime...
Import urllib
import urllib2
url = 'http://www.putlocker.com/file/A189D40E3E612C50.'
response = urllib2.urlopen(url)
page = response.read()
from here i find the Value="?" of Hash
then
url = 'http://www.putlocker.com/file/A189D40E3E612C50.'
values = {'hash' : 3d0865fbb040e670}
data = urllib.urlencode(values)
response = urllib2.urlopen(url)
page = response.read()
But I end up on same page again. Do I post value="Continue as Free User" as well?
How do I go ahead with posting both data.
A working code would be helpful.
I am trying hard but no avail yet.
Ok..after the suggestion made by few programmers
I tried with codes like below
url = 'http://www.putlocker.com/file/A189D40E3E612C50'
response = urllib2.urlopen(url)
html = response.read()
r = re.search('value="([0-9a-f]+?)" name="hash"', html)
session_hash = r.group(1)
print session_hash
form_values = {}
form_values['hash'] = session_hash
form_values['confirm'] = 'Continue as Free User'
data = urllib.urlencode(form_values)
response = urllib2.urlopen(url, data=data)
html = response.read()
print html
So I am returned with same page again again..What am I doing wrong here!! I have seen something called pycurl..but I want use something simpler..Any clue??
urllib2.urlopen(url,data=data)
You do need to give your encoded values parameter to the urlopen command:
response = urllib2.urlopen(url, data)
otherwise you will create another GET request instead of POSTing.