What I want to accomplish is to download an .xlsx file from a link like the below:
https://......./something.do?parameter=[parameter_value]
Please note that it's meaningless to show the exact link since it's an internal link.
The problem is that the download starts automatically if I open the link in a browser. But when I want to do it programmatically, I cannot get the exact link of the file.
I figured out that in http response header the content-disposition attribute contains the file name like this:
Content-Disposition: attachment; filename="ABCD.xlsx"
But I couldn't catch the file so far, only the html code of the site.
Currently my python code looks like this:
import requests
urlBase = 'link to the authetication page'
urlFile = 'https://......./something.do?parameter=[parameter_value]' //like the above link
user = 'username'
pw = 'password'
session = requests.Session()
session.auth = (user, pw)
auth = session.post(urlBase)
response = session.get(urlFile)
Response is currently showing the html code.
Thanks in advance
Related
I am attempting to download a zip file from a website that sits with an https:// link. I have tried the following but can't seem to get any output. Could anyone suggest what I might be doing wrong?
URL = www.somewebsite.com
Download zip file = www.somewebsite.com/output/revisionId=40687821$$Xiiy75&action_id=
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))
To download a file from a non protected url do something like:
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))
with open("result.zip", "wb") as fout:
fout.write(resp.content)
If course you should check whether you got a valid response before writing the zip file.
For a considerable amount of websites with login following recipe will work:
However if asite.com uses too much javascript, this might not necessarily work.
Use a requests session in order to store any session cookies and perform following three steps.
GET the login url. This will get potential session cookies or CSRF protection cookies
POST to the login url with the username and password. the name of the forms to be posted depend on the page. Use your web browser in debug mode to learn about the right values that you have to post, this can be more parameters than username and password
List item
GET the document url and save the result to a file.
On Firefox for example you go to the website you want to login, you press F12 (for debug mode), click on the network tab and then on reload.
You might
Fill in the login form and submit and look in the debug panel for a POST request.
The generic python code would look like.
import requests
def login_and_download():
ses = requests.session()
# Step 1 get the login page
rslt = ses.get("https://www.asite.com/login-home")
# now any potentially required cookie will be set
if rslt.status_code != 200:
print("failed getting login page")
return False
# for simple pages you can procedd to login
# for a little more complicated pages you might have to parse the
# HTML
# for really annoying pages that use loads of javascript it might be
# even more complicated
# Step 2 perform a post request to login
login_post_url = # This depends on the site you want to connect to. you have analyze the login
# procedure
rslt = ses.post(login_post_url)
if rslt.status_code != 200:
print("failed logging in")
return False
# Step 3 download the url, that you want to get.
rslt = ses.get(url_of_your_document)
if rslt.status_code != 200:
print("failed fetching the file")
return False
with open("result.zip", "wb") as fout:
fout.write(resp.content)
I'm working with the website 'musescore.com' that has many files in the '.mxl' format that I need to download automatically with Python.
Each file on the website has a unique ID number. Here's a link to an example file:
https://musescore.com/user/43726/scores/76643
The last number in the URL is the id number for this file. I have no idea where on the website the mxl file for score is located, but I know that to download the file, one must visit this url:
https://musescore.com/score/76643/download/mxl
This link is the same for every file, but with that file's particular ID number in it. As I understand it, this url executes code that downloads the file, and is not an actual path to the file.
Here's my code:
import requests
url = 'https://musescore.com/score/76643/download/mxl'
user = 'myusername'
password = 'mypassword'
r = requests.get(url, auth=(user, password), stream=True)
with open('file.mxl', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
This code downloads a webpage saying I need to sign in to download the file. It is supposed to download the mxl file for this score. This must mean I am improperly authenticating the website. How can I fix this?
By passing an auth parameter to get, you're attempting to utilize HTTP Basic Authentication, which is not what this particular site uses. You'll need to use an instance of request.Session to post to their login endpoint and maintain the cookie(s) that result from that process.
Additionally, this site utilizes a csrf token that you must first extract from the login page in order to include it with your post to the login endpoint.
Here is a working example, obviously you will need to change the username and password to your own:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('https://musescore.com/user/login')
soup = BeautifulSoup(r.content, 'html.parser')
csrf = soup.find('input', {'name': '_csrf'})['value']
s.post('https://musescore.com/user/auth/login/process', data={
'username': 'herp#derp.biz',
'password': 'secret',
'_csrf': csrf,
'op': 'Log in'
})
r = s.get('https://musescore.com/score/76643/download/mxl')
print(f"Status: {r.status_code}")
print(f"Content-Type: {r.headers['content-type']}")
Result, with content type showing it is successfully downloading the file:
Status: 200
Content-Type: application/vnd.recordare.musicxml
* Updated to clarify information from responses *
There is a website my IT organization set up that allows us to submit a set of parameters to a web form, click on a "submit" button, and then it generates a .txt file of users provisioned to specified applications which is (at least using my current Chrome settings) automatically sent to the download folder.
In order to automate this process and get an updated list of users each week, I've been trying to write a python script that uses urllib (+ urllib2, requests, etc.) in order to submit the form and then grab the .txt file that is downloaded.
When I try running the code below...
import urllib, urllib2
url = 'my url'
values = {'param1' : 'response1',
'param2' : 'response2',
'param3' : 'response3'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
data = response.read()
...it doesn't throw any errors, but I don't get any response either. I've checked all the likely paths that the file would download to and can't find anything.
And if I add something like...
with open('response.txt', 'w') as f:
f.write(data)
...then it just writes the source HTML for the page to the file; it doesn't actually grab the file generated by the query I'm essentially posting through the form.
Any help here would be greatly appreciated!
You haven't saved the response to a file.
with open('response.txt', 'w') as f:
f.write(data)
That will save a file called response.txt to the directory you have run the script from. If you just want to check the contents of the response you can use:
print(data)
I am trying to download a file over https using python requests. I wrote a sample code for this. When i run my code it doesnot download the pdf file given in link. Instead downloads the html code for the login page. I checked the response status code and it is giving 200. To download the file login is necessary. How to download the file?
My code:
import requests
import json
# Original File url = "https://seller.flipkart.com/order_management/manifest.pdf?sellerId=8k5wk7b2qk83iff7"
url = "https://seller.flipkart.com/order_management/manifest.pdf"
uname = "xxx#gmail.com"
pwd = "xxx"
pl1 = {'sellerId':'8k5wk7b2qk83i'}
payload = {uname:pwd}
ses = requests.Session()
res = ses.post(url, data=json.dumps(payload))
resp = ses.get(url, params = pl1)
print resp.status_code
print resp.content
I tried several solutions including Sending a POST request with my login creadentials using requests' session object then downloading file using same session object. but it didn't worked.
EDIT:
It still is returning the html for login page.
Have you tried to pass the auth param to the GET? something like this:
resp = requests.get(url, params=pl1, auth=(uname, pwd))
And you can write resp.content to a local file myfile.pdf
fd = open('myfile.pdf', 'wb')
fd.write(resp.content)
fd.close()
I have a webform that I want to submit using python. I know from looking at the source code that to send a file to the site I need to use 'FILE'. However, when I run the following code on that site:
url = "http://mascot.proteomics.dundee.ac.uk/cgi/search_form.pl?FORMVER=2&SEARCH=MIS"
values = {'FILE' : '/homes/ndeklein/test.mzML'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
The page contains the following:
<HTML><HEAD><TITLE>Fatal Error</TITLE></HEAD>
<BODY>
<H1>Fatal Error</H1>
<P><B>must specify search type</B><P></BODY></HTML>
So I must give the file type. However, I have no idea how to find out what name the file type has in the webform. If I had a list of everything being send when doing it by hand I could probably figure it out. So how can I find out what POST uses for the file type, or how can I get a list of everything being send by the webform?
I assume you have access to the form in a browser. When your browser submits that form you can see what is submitted using the browsers developer tools (Firebug Addon for Firefox).