Get attached PDF file from HTTP request - python

I would like to download a file like this: https://www.bbs.unibo.it/conferma/?var=FormScaricaBrochure&brochureid=61305 with Python.
The problem is that is not directly a link to the file, but I only get the file id with query string.
I tried this code:
import requests
remote_url = "https://www.bbs.unibo.it/conferma/"
r = requests.get(remote_url, params = {"var":"FormScaricaBrochure", "brochureid": 61305})
But only the HTML is returned. How can I get the attached pdf?

You can use this example how to download the file using only brochureid:
import requests
url = "https://www.bbs.unibo.it/wp-content/themes/bbs/brochure-download.php?post_id={brochureid}&presentazione=true"
brochureid = 61305
with open("file.pdf", "wb") as f_out:
f_out.write(requests.get(url.format(brochureid=brochureid)).content)
Downloads the PDF to file.pdf (screenshot):

Related

Python - download a file from URL

I would like to know, how to download a file from a specific URL without knowing file type and name in Python? Simply like downloading it by opening via browser.
URL example:
https://sourceforge.net/projects/portableapps/files/PortableApps.com%20Platform/PortableApps.com_Platform_Setup_19.0.paf.exe/download?use_mirror=deac-fra&use_mirror=deac-fra&r=
Try this:
import requests
URL = "http://www.example.com"
with open ("Filename", "wb") as f:
f.write(requests.get(URL).content)

File corrupted when I try to download via requests.get()

I'm trying to automate the download of docs via Selenium.
I'm using requests.get() to download the file after extracting the url from the website:
import requests
url= 'https://www.schroders.com/hkrprewrite/retail/en/attach.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
myfile = requests.get(url)
open('/Users/hemanthj/Downloads/AB Test/' + "A-Acc-USD" + '.pdf', 'wb').write(myfile.content)
time.sleep(3)
The file is downloaded but is corrupted when I try to open. The file size is only a few KB at most.
I tried adding the header info from this thread too but no luck:
Corrupted PDF file after requests.get() with Python
What within the headers makes the download work? Any solutions?
The problem was in an incorrect URL.
It loaded HTML instead of PDF.
Looking throw the site I found the URL that you were looking for.
Try this code and then open the document with pdf reader program.
import requests
import pathlib
def load_pdf_from(url:str, filename:pathlib.Path) -> None:
response:requests.Response = requests.get(url, stream=True)
if response.status_code == 200:
with open(filename, 'wb') as pdf_file:
for chunk in response.iter_content(chunk_size=1024):
pdf_file.write(chunk)
else:
print(f"Failed to load pdf: {url}")
url:str = 'https://www.schroders.com/hkrprewrite/retail/en/attachment2.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
target_filename:pathlib.Path = pathlib.Path.cwd().joinpath('loaded_pdf.pdf')
load_pdf_from(url, target_filename)

urllib.request.urlretrieve returns corrupt file (How to handle this kind of url?)

I want to download about 1000 pdf files from a web page.
Then I encountered this awkward pdf url format.
Both requests.get() and urllib.request.urlretrieve() don't work for me.
Usual pdf url looks like :
https://webpage.com/this_file.pdf
But this url is like :
https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01
So it doesn't have .pdf in url, and if you click on it, you can download it, But using python's urllib, you get corrupt file.
At first I thought it is redirected into some other url.
So I used request.get(url, allow_retrieves=True) option,
the result is the same url as before..
filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'
urllib.request.urlretrieve(url, filename)
this code downloads corrupt pdf file.
I solved it using content field in the retrieved object.
filename = './novel1/pdf1.pdf'
url = . . .
object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
f.write(t.content)
refered to this QnA ; Download and save PDF file with Python requests module

Download a binary file using Python requests module

I need to download a file from an external source, I am using Basic authentication to login to the URL
import requests
response = requests.get('<external url', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<API URL to download the attachment>', auth=('<username>', '<password>'), stream=True)
print (data.content)
I am getting below output
<url to download the binary data>
\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\xcb\x00\x00\x1e\x00\x1e\x00\xbe\x07\x00\x00.\xcf\x05\x00\x00\x00'
I am expecting the URL to download the word document within the same session.
Working solution
import requests
import shutil
response = requests.get('<url>', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<url>', auth=('<username>', '<password>'), stream=True)
with open("C:/myfile.docx", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
I am able to download the file as it is.
When you want to download a file directly you can use shutil.copyfileobj():
https://docs.python.org/2/library/shutil.html#shutil.copyfileobj
You already are passing stream=True to requests which is what you need to get a file-like object back. Just pass that as the source to copyfileobj().

Downloading file from imgur using python directly via url

Sometime, links to imgur are not given with the file extension. For example: http://imgur.com/rqCqA. I want to download the file and give it a known name or get it name inside a larger code. The problem is that I don't know the file type, so I don't know what extension to give it.
How can I achieve this in python or bash?
You should use the Imgur JSON API. Here's an example in Python, using requests:
import posixpath
import urllib.parse
import requests
url = "http://api.imgur.com/2/image/rqCqA.json"
r = requests.get(url)
img_url = r.json["image"]["links"]["original"]
fn = posixpath.basename(urllib.parse.urlsplit(img_url).path)
r = requests.get(img_url)
with open(fn, "wb") as f:
f.write(r.content)
I just tried going to the following URLs:
http://imgur.com/rqCqA.jpg
http://imgur.com/rqCqA.png
http://imgur.com/rqCqA.gif
And they all worked. It seems that Imgur stores several types of the same image - you can take your pick.
I've used this before to download tons of xkcd webcomics and it seems to work for this as well.
def saveImage(url, fpath):
contents = urllib2.urlopen(url)
f = open(fpath, 'w')
f.write(contents.read())
f.close()
Hope this helps
You can parse the source of the page using BeautifulSoup or similar and look for img tags with the photo hash in the src. With your example, the pic is
<img alt="" src="http://i.imgur.com/rqCqA.jpg" original-title="">

Categories