I'm looking for a way to get files such as the one in this link, which can be downloaded by clicking a "download" button. I couldn't find a way despite reading many posts that seemed to be relevant.
The code I got so far:
import requests
from bs4 import BeautifulSoup as bs
with open('ukb49810.html', 'r') as f:
html = f.read()
index_page = bs(html, 'html.parser')
for i in index_page.find_all('a', href=True)[2:]:
if 'coding' in i['href']:
file = requests.get(i['href']).text
download_page = bs(file, 'html.parser').find_all('a', href=True)
From the download_page variable I got "URLs" with the code
for ii in download_page:
print(ii['href'])
which printed
http://
index.cgi
browse.cgi?id=9&cd=data_coding
search.cgi
catalogs.cgi
download.cgi
https://bbams.ndph.ox.ac.uk/ams/resApplications
help.cgi?cd=data_coding
field.cgi?id=22001
field.cgi?id=22001
label.cgi?id=100313
field.cgi?id=31
field.cgi?id=31
label.cgi?id=100094
I tried to use these supposedly-URLs to compose the download URL but the link I got didn't work.
Thanks.
None of these links are to the download page. If you view source on the page, you will see how the download is done:
<form method="post" action="codown.cgi">
<input type="hidden" name="id" value="9"></td><td>
<input class="btn_glow" type="submit" value="Download">
</form>
So you would need to submit a POST request to codown.cgi with your value, something like:
curl --request POST \
--url https://biobank.ndph.ox.ac.uk/showcase/codown.cgi \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data id=9
However the thing I would suggest is searching the site for a more convenient option than scraping. On something like this it's likely to available (and indeed, it is in this case!)
It looks like all of the data you can get from that page (and its variants) can be obtained from the Downloads->Schema page, and those all offer simple download links you can use, eg:
https://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=5
Related
I've looked at quite a few suggestions for clicking on buttons on web pages using python, but don't fully understand what the examples are doing and can't get them to work (particularly the ordering and combination of values).
I'm trying to download a PDF from a web site.
The first time you click on the PDF to download it, it takes you to a page where you have to click on "Agree and Proceed". Once you do that the browser stores the cookie (so you never need to agree again) and then opens the PDF in the browser (which is what I want to download).
Here is the link to the accept page - https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753"
I've used Chrome Developer to get this:-
<form name="showAnnouncementPDFForm" method="post" action="announcementTerms.do">
<input value="Decline" onclick="window.close();return false;" type="submit">
<input value="Agree and proceed" type="submit">
<input name="pdfURL" value="/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf" type="hidden">
</form>
and this is the final page you get to:- "https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf"
then tried to use it like this:-
import requests
values = {}
values['showAnnouncementRDFForm'] = 'announcementTerms.do'
values['pdfURL'] = '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'
req = requests.post('https://asx.com.au/', data=values)
print(req.text)
I tried a variety of URL's and changed what values I provide, but I don't think it's working correctly.
The print at the end provides me with what looks like the HTML form a web page. I'm not sure exactly what it is as I'm doing this from the command line of a server I'm ssh'd into (Pi). But I'm confident it's not the PDF I'm after.
As a final solution I'd like the python code to do is take the PDF link, automatically Agree and Proceed, store the cookie for next to to avoid future approvals, then download the PDF.
Hope that made sense and thanks for taking the time to read my question.
Markus
If you want to download the file directly and you know the URL you can access it without using a cookie:
import requests
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf")
with open('./test1.pdf', 'wb') as f:
f.write(response.content)
If you don't know the URL you can read it from the form then access it directly without a cookie:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
response = requests.get(f'{base_url}{pdf_url}')
with open('./test2.pdf', 'wb') as f:
f.write(response.content)
if you want to set the cookie:
import requests
cookies = {'companntc': 'tc'}
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf", cookies=cookies)
with open('./test3.pdf', 'wb') as f:
f.write(response.content)
If you really want to use POST:
import requests
payload = {'pdfURL': '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'}
response = requests.post('https://www.asx.com.au/asx/statistics/announcementTerms.do', params=payload)
with open('./test4.pdf', 'wb') as f:
f.write(response.content)
Or read the pdfURL from the form and do a POST:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
payload = {'pdfURL': pdf_url}
response = requests.post(f"{base_url}/asx/statistics/announcementTerms.do", params=payload)
with open('./test5.pdf', 'wb') as f:
f.write(response.content)
I have a certain situation which I'm out of ideas on how exactly to proceed. I have a very repetitive task to do which consists of:
Choose file from list of files
Press submit
Repeat until all files in folder have been submitted/uploaded
Sometimes I have 100's of files at a time, which can be very time consuming. I would like to write a script to automate this routine.
This is the visual of the page in question:
Menu Format
Of course this is represented by the following html code:
<input type="file" class="inputFile" data-name="userNumListFile">
<form class="navbar-form navbar-left" method="post" action="/give/giveItemBatch" enctype="multipart/form-data"><button type="submit" class="btn btn-default">Submit</button></form>
Those are the two entries that represent what I need to send a HTTP request to. I have done something similar in Python where I used the following code to access a authorization only webpage and then use bs4 to gather info needed.
import requests
payload = {'username': 'user',
'password': 'pw',
'rememberMe': 'true'}
with requests.Session() as s:
url = "http://yada.com"
p = s.post(url, data=payload)
soup = BeautifulSoup(p.text, "html.parser")
I was wondering if there is something similar to the above where I can submit a file to be uploaded and then press the submit button.
I would then cycle through all the files on my folder, that's the easy part.
Just use requests.post inside a loop, the name of the remote folder. First read local files and store it inside one array then start a Loop and put inside requests.post with the remote target.
im having a trouble understanding the module requests, i understand that http have post,get, put, delete methods, but i think i need to know more about how requests works, i have read the documentation but still i have a lot of questions about how to do something, this is the first time i try to make a script for web without selenium or mechanize
im trying to interact with vubey.yt, but i cant make my vubey url change at what i want(or what i see when i manually use the pag) i can send my data, and it changes the url, but if i copy that url and navigate manually, it does nothing... so i dont understand whats happening, because i dont have any visual clue
here is my code (python 3.5):
def Descarga(youtubeid):
# also i have tried only sending videoURL without quality and sub, but is the same
r = requests.get('https://vubey.yt/', params={'videoURL': youtubeid, 'quality': '320', 'submit': 'Convert+To+MP3'})
print(r.url, r.status_code)
Descarga("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
if someone could link me a tutorial for really understand how to use this module or tell me what im doing wrong or misunderstanding about this module i ll thank so much
See with me the site code:
<form class="w-clearfix" name="wf-form-signup-form" data-name="conversionForm" form action="/" method="post" id="conversionForm">
It's a form. The form is using the method 'post' in the same page.
<input class="w-input field" id="videoURL" type="text" placeholder="Video URL" name="videoURL" data-name="videoURL" required="required">
The first data "videoURL".
<select class="w-select" id="quality" name="quality" data-name="quality" required="required">
The second data "quality".
<input class="w-button button" type="submit" name="submit" value="Convert To MP3">
</form>
The submit button is not important. Ignore it.
Now, lets pythonify.
import requests
video_url = 'https://www.youtube.com/watch?v=C0DPdy98e4c'
quality = '320'
post_data={ 'videoURL': video_url, 'quality': quality }
response = requests.post('https://vubey.yt/', data=post_data)
print(response.url, response.status_code)
Now you can parse the response.content and search for "Please wait" until the conversion is completed.
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
I'm using Python 3.3 and the Requests library to do a basic POST request.
I want to simulate what happens if you manually enter information into the browser from the webpage:
https://www.dspayments.com/FAIRFAX. For example, at that url, enter "x" for the license plate and Virginia as the state. Then the url changes to: https://www.dspayments.com/FAIRFAX/Home/PayOption, and it displays the desired information (I care about the source code of this second webpage).
I looked through the source code of the above two url's. Doing "inspect element" on the text boxes of the first url I found some things that need to be included in the post request: {'Plate':"x", 'PlateStateProv':"VA", "submit":"Search"}.
Then the second website (ending in /PayOption), had the raw html:
<form action="/FAIRFAX/Home/PayOption" method="post"><input name="__RequestVerificationToken" type="hidden" value="6OBKbiFcSa6tCqU8k75uf00m_byjxANUbacPXgK2evexESNDz_1cwkUpVVePA2czBLYgKvdEK-Oqk4WuyREi9advmDAEkcC2JvfG2VaVBWkvF3O48k74RXqx7IzwWqSB5PzIJ83P7C5EpTE1CwuWM9MGR2mTVMWyFfpzLnDfFpM1" /><div class="validation-summary-valid" data-valmsg-summary="true">
I then used the name:value pairs from the above html as keys and values in my payload dictionary of the post request. I think the problem is that in the second url, there is the "__RequestVerificationToken" which seems to have a randomly generated value every time.
How can I properly POST to this website? A "correct" answer would be one that produces the same source code on the website ending in "/PayOption" as if you manually enter "x" as the plate number and Virginia as the state and click submit on the first url.
My code is:
import requests
url1 = r'https://www.dspayments.com/FAIRFAX'
url2 = r'https://www.dspayments.com/FAIRFAX/Home/PayOption'
s = requests.Session()
#GET request
r = s.get(url1)
text1 = r.text
startstr = '<input name="__RequestVerificationToken" type="hidden" value="'
start_ind = text1.find(startstr)+len(startstr)
end_ind = text1.find('"',start_ind)
auth_string = text1[start_ind:end_ind]
#POST request
payload = {'Plate':'x', 'PlateStateProv':'VA',"submit":"Search",
"__RequestVerificationToken":auth_string,"validation-summary-valid":"true"}
post = s.post(url2, headers=user_agent, data=payload)
source_code = post.text
Thanks, -K.
You should only need the data from the first page, and as you say, the __RequestVerificationToken changes with each request.
You'll have to do something like:
GET request to https://www.dspayments.com/FAIRFAX
harvest __RequestVerificationToken value (Requests Session will take care of any associated cookies)
POST using the data you scraped from the GET request
extract whatever you need from the 2nd page
So, just focus on creating a form that's exactly like the one in the first page. Have a stab at it and if you're still struggling I can help dig into the particulars.