Downloading file from imgur using python directly via url - python

Sometime, links to imgur are not given with the file extension. For example: http://imgur.com/rqCqA. I want to download the file and give it a known name or get it name inside a larger code. The problem is that I don't know the file type, so I don't know what extension to give it.
How can I achieve this in python or bash?

You should use the Imgur JSON API. Here's an example in Python, using requests:
import posixpath
import urllib.parse
import requests
url = "http://api.imgur.com/2/image/rqCqA.json"
r = requests.get(url)
img_url = r.json["image"]["links"]["original"]
fn = posixpath.basename(urllib.parse.urlsplit(img_url).path)
r = requests.get(img_url)
with open(fn, "wb") as f:
f.write(r.content)

I just tried going to the following URLs:
http://imgur.com/rqCqA.jpg
http://imgur.com/rqCqA.png
http://imgur.com/rqCqA.gif
And they all worked. It seems that Imgur stores several types of the same image - you can take your pick.

I've used this before to download tons of xkcd webcomics and it seems to work for this as well.
def saveImage(url, fpath):
contents = urllib2.urlopen(url)
f = open(fpath, 'w')
f.write(contents.read())
f.close()
Hope this helps

You can parse the source of the page using BeautifulSoup or similar and look for img tags with the photo hash in the src. With your example, the pic is
<img alt="" src="http://i.imgur.com/rqCqA.jpg" original-title="">

Related

Get attached PDF file from HTTP request

I would like to download a file like this: https://www.bbs.unibo.it/conferma/?var=FormScaricaBrochure&brochureid=61305 with Python.
The problem is that is not directly a link to the file, but I only get the file id with query string.
I tried this code:
import requests
remote_url = "https://www.bbs.unibo.it/conferma/"
r = requests.get(remote_url, params = {"var":"FormScaricaBrochure", "brochureid": 61305})
But only the HTML is returned. How can I get the attached pdf?
You can use this example how to download the file using only brochureid:
import requests
url = "https://www.bbs.unibo.it/wp-content/themes/bbs/brochure-download.php?post_id={brochureid}&presentazione=true"
brochureid = 61305
with open("file.pdf", "wb") as f_out:
f_out.write(requests.get(url.format(brochureid=brochureid)).content)
Downloads the PDF to file.pdf (screenshot):

How do i download a image from a URL into a specific folder

I've been trying to write a function that receives a list of URLs and downloads each image from each URL to a given folder. I understand that I am supposed to be using the urlib library but I am not sure how..
the function should start like this :
def download_images(img_urls, dest_dir):
I don't even know how to start and could only find information online on how to download an image but not into a specific folder. If anyone can help me understand how to do the above, it would be wonderful.
thank you in advance:)
Try this:
import urllib.request
urllib.request.urlretrieve('http://image-url', '/dest/path/file_name.jpg')
You can use requests library, for example:
import requests
image_url = 'https://jessehouwing.net/content/images/size/w2000/2018/07/stackoverflow-1.png'
try:
response = requests.get(image_url)
except:
print('Error')
else:
if response.status_code == 200:
with open('stackoverflow-1.png', 'wb') as f:
f.write(response.content)
Here it's a simple solution for your problem using urllib.request.urlretrieve for download the image from your url list img_urls and os.path.basename to get the file name from the url so you can save it with its original name in your dest_dir
from urllib.request import urlretrieve
import os
def download_images(img_urls, dest_dir):
for url in img_urls:
urlretrieve(url, dest_dir+os.path.basename(url))

is there a way to use python-requests to access web pages which are in fact pdfs?

I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)

Python webscraping: Image incomplete when using urllib

I am trying to retrieve an image using Python and BeautifulSoup. I managed to get the full url of the image but when I use urllib.urlretrieve(imagelink, filename), it retrieves the image but the image is incomplete, only 3.2kb.
The real images (im getting a lot of images) average around 800kb. It iterates through and downloads all the images but none of them are viewable and are all the same filesize. The full image urls work fine when opened in the browser though.
Any idea what could cause such an issue? I don't think showing my code would help but here is the section where I am getting the url:
print imagelink
filename = imagelink.split('/')[-1]
time.sleep(5)
urllib.urlretrieve(imagelink, filename)
time.sleep(5)
宏杰李, requests is a wrapper for urllib. As it is also a wrapper for sockets -))
With urllib2 the same result can be achieved like this.
>>> import urllib2
>>> r = urllib2.urlopen('https://i.stack.imgur.com/tkGEv.jpg?s=328&g=1')
>>> with open("/home/ziya/Pictures/so_image.jpg", "wb") as img:
... img.write(r.read())
You should try requests:
import requests
url = 'https://i.stack.imgur.com/tkGEv.jpg?s=328&g=1'
r = requests.get(url)
with open('tkGEv.jpg', 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)

How to Download PDFs from Scraped Links [Python]?

I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print("PDF Saved")
getPDFs()
Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.
Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.
If it's of any use, I'm using Python 3.4.2
If this is something that does not require being logged in, you can use urlretrieve():
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)

Categories