I download a bunch of pdf files and archive them.
Most of the documents work fine but I have a problem with one.
The link to the document which doesn't work is:
https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf
When I download it normally, it just workds fine.
I tried two different approaches with python to download it.
response = requests.get('https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf',
stream=True)
with open(
'test.pdf',
'wb') as r:
for chunk in response.iter_content(2000):
r.write(chunk)
r.close()
Second approach:
def pdfDownload(url):
response = requests.get(url)
expdf = response.content
egpdf = open('test.pdf', 'wb')
egpdf.write(expdf)
egpdf.close()
In both cases I get an error message when I try to open it afterwards.
you need to replace your url with this
https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf?switchLocale=y&siteEntryPassthrough=true
Related
I'm trying to download this video: https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4
I tried the following but it doesn't work.
link = "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4"
urllib.request.urlretrieve(link, 'video.mp4')
I'm getting:
urllib.error.HTTPError: HTTP Error 403: Forbidden
Is there another way to download an mp4 file without using urllib?
I have no problem to download with module requests
import requests
url = 'https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4'
response = requests.get(url)
with open('video.mp4', 'wb') as f: # use `"b"` to open in `bytes mode`
f.write(response.content) # use `.content` to get `bytes`
It was small file ~10MB but for bigger file you may download in chunks.
import requests
url = 'https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4'
response = requests.get(url, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in response.iter_content(10000): # 10_000 bytes
if chunk:
#print('.', end='') # every dot will mean 10_000 bytes
f.write(chunk)
Documentation shows Streaming Requests but for text data.
url is a string so you can use string-functions to get element after last /
filename = url.split('/')[-1]
Or you can try to use os.path
At least it works on Linux - maybe because Linux also use / in local paths.
import os
head, tail = os.path.split(url)
# head: 'https://www.learningcontainer.com/wp-content/uploads/2020/05'
# tail: 'sample-mp4-file.mp4'
I'm trying to automate the download of docs via Selenium.
I'm using requests.get() to download the file after extracting the url from the website:
import requests
url= 'https://www.schroders.com/hkrprewrite/retail/en/attach.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
myfile = requests.get(url)
open('/Users/hemanthj/Downloads/AB Test/' + "A-Acc-USD" + '.pdf', 'wb').write(myfile.content)
time.sleep(3)
The file is downloaded but is corrupted when I try to open. The file size is only a few KB at most.
I tried adding the header info from this thread too but no luck:
Corrupted PDF file after requests.get() with Python
What within the headers makes the download work? Any solutions?
The problem was in an incorrect URL.
It loaded HTML instead of PDF.
Looking throw the site I found the URL that you were looking for.
Try this code and then open the document with pdf reader program.
import requests
import pathlib
def load_pdf_from(url:str, filename:pathlib.Path) -> None:
response:requests.Response = requests.get(url, stream=True)
if response.status_code == 200:
with open(filename, 'wb') as pdf_file:
for chunk in response.iter_content(chunk_size=1024):
pdf_file.write(chunk)
else:
print(f"Failed to load pdf: {url}")
url:str = 'https://www.schroders.com/hkrprewrite/retail/en/attachment2.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
target_filename:pathlib.Path = pathlib.Path.cwd().joinpath('loaded_pdf.pdf')
load_pdf_from(url, target_filename)
I want to download about 1000 pdf files from a web page.
Then I encountered this awkward pdf url format.
Both requests.get() and urllib.request.urlretrieve() don't work for me.
Usual pdf url looks like :
https://webpage.com/this_file.pdf
But this url is like :
https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01
So it doesn't have .pdf in url, and if you click on it, you can download it, But using python's urllib, you get corrupt file.
At first I thought it is redirected into some other url.
So I used request.get(url, allow_retrieves=True) option,
the result is the same url as before..
filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'
urllib.request.urlretrieve(url, filename)
this code downloads corrupt pdf file.
I solved it using content field in the retrieved object.
filename = './novel1/pdf1.pdf'
url = . . .
object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
f.write(t.content)
refered to this QnA ; Download and save PDF file with Python requests module
I'm trying to take some links from a text file and just download them onto my computer. However I would like those downloaded pages to be completely the same as it is in browser. These wiki pages I downloaded are not the same, they don't display some of the pictures and it's just text mostly when I open them.
How can I achieve what I want, saw some things with scrapy and beautiful soup however I'm not exp
My code:
import urllib.request
links=[]
fr=open('wiki_linkovi','r')
fw1=open('imena_elemenata.txt', 'w')
link=fr.readlines()
j=0
for i in link:
base='https://en.wikipedia.org/wiki/'
start=i.find(base)+len(base)
end=i.find('\n',start)
ime=i[start:end]
fw1.write(ime+'\n')
response = urllib.request.urlopen(i) #save starts here-----
webContent = response.read()
f = open(ime+'.html', 'wb')
f.write(webContent)
f.close
j=j+1
print(str(j)+'. link\n')
So yeah In short, I'd like to download webpage completely
Sometime, links to imgur are not given with the file extension. For example: http://imgur.com/rqCqA. I want to download the file and give it a known name or get it name inside a larger code. The problem is that I don't know the file type, so I don't know what extension to give it.
How can I achieve this in python or bash?
You should use the Imgur JSON API. Here's an example in Python, using requests:
import posixpath
import urllib.parse
import requests
url = "http://api.imgur.com/2/image/rqCqA.json"
r = requests.get(url)
img_url = r.json["image"]["links"]["original"]
fn = posixpath.basename(urllib.parse.urlsplit(img_url).path)
r = requests.get(img_url)
with open(fn, "wb") as f:
f.write(r.content)
I just tried going to the following URLs:
http://imgur.com/rqCqA.jpg
http://imgur.com/rqCqA.png
http://imgur.com/rqCqA.gif
And they all worked. It seems that Imgur stores several types of the same image - you can take your pick.
I've used this before to download tons of xkcd webcomics and it seems to work for this as well.
def saveImage(url, fpath):
contents = urllib2.urlopen(url)
f = open(fpath, 'w')
f.write(contents.read())
f.close()
Hope this helps
You can parse the source of the page using BeautifulSoup or similar and look for img tags with the photo hash in the src. With your example, the pic is
<img alt="" src="http://i.imgur.com/rqCqA.jpg" original-title="">