Download a binary file using Python requests module - python

I need to download a file from an external source, I am using Basic authentication to login to the URL
import requests
response = requests.get('<external url', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<API URL to download the attachment>', auth=('<username>', '<password>'), stream=True)
print (data.content)
I am getting below output
<url to download the binary data>
\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\xcb\x00\x00\x1e\x00\x1e\x00\xbe\x07\x00\x00.\xcf\x05\x00\x00\x00'
I am expecting the URL to download the word document within the same session.

Working solution
import requests
import shutil
response = requests.get('<url>', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<url>', auth=('<username>', '<password>'), stream=True)
with open("C:/myfile.docx", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
I am able to download the file as it is.

When you want to download a file directly you can use shutil.copyfileobj():
https://docs.python.org/2/library/shutil.html#shutil.copyfileobj
You already are passing stream=True to requests which is what you need to get a file-like object back. Just pass that as the source to copyfileobj().

Related

Get attached PDF file from HTTP request

I would like to download a file like this: https://www.bbs.unibo.it/conferma/?var=FormScaricaBrochure&brochureid=61305 with Python.
The problem is that is not directly a link to the file, but I only get the file id with query string.
I tried this code:
import requests
remote_url = "https://www.bbs.unibo.it/conferma/"
r = requests.get(remote_url, params = {"var":"FormScaricaBrochure", "brochureid": 61305})
But only the HTML is returned. How can I get the attached pdf?
You can use this example how to download the file using only brochureid:
import requests
url = "https://www.bbs.unibo.it/wp-content/themes/bbs/brochure-download.php?post_id={brochureid}&presentazione=true"
brochureid = 61305
with open("file.pdf", "wb") as f_out:
f_out.write(requests.get(url.format(brochureid=brochureid)).content)
Downloads the PDF to file.pdf (screenshot):

File corrupted when I try to download via requests.get()

I'm trying to automate the download of docs via Selenium.
I'm using requests.get() to download the file after extracting the url from the website:
import requests
url= 'https://www.schroders.com/hkrprewrite/retail/en/attach.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
myfile = requests.get(url)
open('/Users/hemanthj/Downloads/AB Test/' + "A-Acc-USD" + '.pdf', 'wb').write(myfile.content)
time.sleep(3)
The file is downloaded but is corrupted when I try to open. The file size is only a few KB at most.
I tried adding the header info from this thread too but no luck:
Corrupted PDF file after requests.get() with Python
What within the headers makes the download work? Any solutions?
The problem was in an incorrect URL.
It loaded HTML instead of PDF.
Looking throw the site I found the URL that you were looking for.
Try this code and then open the document with pdf reader program.
import requests
import pathlib
def load_pdf_from(url:str, filename:pathlib.Path) -> None:
response:requests.Response = requests.get(url, stream=True)
if response.status_code == 200:
with open(filename, 'wb') as pdf_file:
for chunk in response.iter_content(chunk_size=1024):
pdf_file.write(chunk)
else:
print(f"Failed to load pdf: {url}")
url:str = 'https://www.schroders.com/hkrprewrite/retail/en/attachment2.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
target_filename:pathlib.Path = pathlib.Path.cwd().joinpath('loaded_pdf.pdf')
load_pdf_from(url, target_filename)

How to retrieve a file from github through python program

I am trying to retrieve an xml file from the github using python program. I am trying different url but for each of them it say content not found. What are different options to retrieve the file from github? Any help on this would be much appreciated!!
import requests
from github import Github
import base64
g = Github("access_token")
#url = 'https://api.github.com/repos/{username}/{repos_name}/contents/{path}/{filename}.xml'
#url = 'https://git.<company domain>.com/raw/IT/{repos_name}/{path}/{filename}.xml?token<value>'
url = 'https://git.<company domain>.com/raw/IT/{repos_name}/{path}/{filename}.xml?token=<value>'
req = requests.get(url)
#print ('Keep Going!', req.content)
if req.status_code == requests.codes.ok:
req = req.json()
`. `# the response is a JSON
# req is now a dict with keys: name, encoding, url, size ...
# and content. But it is encoded with base64.
content = base64.decodestring(req['content'])
else:
print('Content was not found.')
output:
Keep Going! b'{"message":"Not Found","documentation_url":"https://developer.github.com/v3/repos/contents/#get-contents"}'
Content was not found.
Replace all the <> and {} variable in your url with the actual path to the file you're trying to retrieve.

Download an SSRS report in Python using requests

I am trying to use requests to download an SSRS report. The following code will download an empty Excel file:
url = 'http://MY REPORT URL HERE/ReportServer?/REPORT NAME HERE&rs:Format=EXCELOPENXML'
s = requests.Session()
s.post(url, data={'_username': 'username, '_password': 'password'})
r = s.get(url)
output_file = r'C:\Saved Reports\File.xlsx'
downloaded_file = open(output_file, 'wb')
for chunk in r.iter_content(100000):
downloaded_file.write(chunk)
I have successfully used requests_ntlm to complete this task, but I am wondering why the above code is not working as intended. The Excel file turns out to be empty; I feel it is due to an issue with logging in and passing those cookies to the GET request.
I was able to get this to work, but for pdfs. I found the solution here
Here's a piece of my code snippet:
import requests
from requests_ntlm import HttpNtlmAuth
session = requests.Session()
session.auth = HttpNtlmAuth(domain+uid,pwd)
response = session.get(reporturl,stream=True)
print response.status_code
with open(outputlocation+mdcProp+'.pdf','wb') as pdf:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)
session.close()

how to Download file over https python requests

I am trying to download a file over https using python requests. I wrote a sample code for this. When i run my code it doesnot download the pdf file given in link. Instead downloads the html code for the login page. I checked the response status code and it is giving 200. To download the file login is necessary. How to download the file?
My code:
import requests
import json
# Original File url = "https://seller.flipkart.com/order_management/manifest.pdf?sellerId=8k5wk7b2qk83iff7"
url = "https://seller.flipkart.com/order_management/manifest.pdf"
uname = "xxx#gmail.com"
pwd = "xxx"
pl1 = {'sellerId':'8k5wk7b2qk83i'}
payload = {uname:pwd}
ses = requests.Session()
res = ses.post(url, data=json.dumps(payload))
resp = ses.get(url, params = pl1)
print resp.status_code
print resp.content
I tried several solutions including Sending a POST request with my login creadentials using requests' session object then downloading file using same session object. but it didn't worked.
EDIT:
It still is returning the html for login page.
Have you tried to pass the auth param to the GET? something like this:
resp = requests.get(url, params=pl1, auth=(uname, pwd))
And you can write resp.content to a local file myfile.pdf
fd = open('myfile.pdf', 'wb')
fd.write(resp.content)
fd.close()

Categories