Python webscraping: Image incomplete when using urllib

Python webscraping: Image incomplete when using urllib - python

I am trying to retrieve an image using Python and BeautifulSoup. I managed to get the full url of the image but when I use urllib.urlretrieve(imagelink, filename), it retrieves the image but the image is incomplete, only 3.2kb.
The real images (im getting a lot of images) average around 800kb. It iterates through and downloads all the images but none of them are viewable and are all the same filesize. The full image urls work fine when opened in the browser though.
Any idea what could cause such an issue? I don't think showing my code would help but here is the section where I am getting the url:
print imagelink
filename = imagelink.split('/')[-1]
time.sleep(5)
urllib.urlretrieve(imagelink, filename)
time.sleep(5)

宏杰李, requests is a wrapper for urllib. As it is also a wrapper for sockets -))
With urllib2 the same result can be achieved like this.
>>> import urllib2
>>> r = urllib2.urlopen('https://i.stack.imgur.com/tkGEv.jpg?s=328&g=1')
>>> with open("/home/ziya/Pictures/so_image.jpg", "wb") as img:
... img.write(r.read())

You should try requests:
import requests
url = 'https://i.stack.imgur.com/tkGEv.jpg?s=328&g=1'
r = requests.get(url)
with open('tkGEv.jpg', 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)

Related

EASY PYTHON SELENIUM: How do I download an mp4 WITHOUT using urllib?

I'm trying to download this video: https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4
I tried the following but it doesn't work.
link = "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4"
urllib.request.urlretrieve(link, 'video.mp4')
I'm getting:
urllib.error.HTTPError: HTTP Error 403: Forbidden
Is there another way to download an mp4 file without using urllib?

I have no problem to download with module requests
import requests
url = 'https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4'
response = requests.get(url)
with open('video.mp4', 'wb') as f: # use `"b"` to open in `bytes mode`
f.write(response.content) # use `.content` to get `bytes`
It was small file ~10MB but for bigger file you may download in chunks.
import requests
url = 'https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4'
response = requests.get(url, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in response.iter_content(10000): # 10_000 bytes
if chunk:
#print('.', end='') # every dot will mean 10_000 bytes
f.write(chunk)
Documentation shows Streaming Requests but for text data.
url is a string so you can use string-functions to get element after last /
filename = url.split('/')[-1]
Or you can try to use os.path
At least it works on Linux - maybe because Linux also use / in local paths.
import os
head, tail = os.path.split(url)
# head: 'https://www.learningcontainer.com/wp-content/uploads/2020/05'
# tail: 'sample-mp4-file.mp4'

How do i download a image from a URL into a specific folder

I've been trying to write a function that receives a list of URLs and downloads each image from each URL to a given folder. I understand that I am supposed to be using the urlib library but I am not sure how..
the function should start like this :
def download_images(img_urls, dest_dir):
I don't even know how to start and could only find information online on how to download an image but not into a specific folder. If anyone can help me understand how to do the above, it would be wonderful.
thank you in advance:)

Try this:
import urllib.request
urllib.request.urlretrieve('http://image-url', '/dest/path/file_name.jpg')

You can use requests library, for example:
import requests
image_url = 'https://jessehouwing.net/content/images/size/w2000/2018/07/stackoverflow-1.png'
try:
response = requests.get(image_url)
except:
print('Error')
else:
if response.status_code == 200:
with open('stackoverflow-1.png', 'wb') as f:
f.write(response.content)

Here it's a simple solution for your problem using urllib.request.urlretrieve for download the image from your url list img_urls and os.path.basename to get the file name from the url so you can save it with its original name in your dest_dir
from urllib.request import urlretrieve
import os
def download_images(img_urls, dest_dir):
for url in img_urls:
urlretrieve(url, dest_dir+os.path.basename(url))

is there a way to use python-requests to access web pages which are in fact pdfs?

I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR

Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)

Run URL to download a file with python

I'm working on a program that downloads data from a series of URLs, like this:
https://server/api/getsensordetails.xmlid=sesnsorID&username=user&password=password
the program goes through a list with IDs (about 2500) and running the URL, try to do it using the following code
import webbrowser
webbrowser.open(url)
but this code implies to open the URL in the browser and confirm if I want to download, I need him to simply download the files without opening a browser and much less without having to confirm
thanks for everything

You can use the Requests library.
import requests
print('Beginning file download with requests')
url = 'http://PathToFile.jpg'
r = requests.get(url)
with open('pathOfFileToReceiveDownload.jpg', 'wb') as f:
f.write(r.content)

Downloading file from imgur using python directly via url

Sometime, links to imgur are not given with the file extension. For example: http://imgur.com/rqCqA. I want to download the file and give it a known name or get it name inside a larger code. The problem is that I don't know the file type, so I don't know what extension to give it.
How can I achieve this in python or bash?

You should use the Imgur JSON API. Here's an example in Python, using requests:
import posixpath
import urllib.parse
import requests
url = "http://api.imgur.com/2/image/rqCqA.json"
r = requests.get(url)
img_url = r.json["image"]["links"]["original"]
fn = posixpath.basename(urllib.parse.urlsplit(img_url).path)
r = requests.get(img_url)
with open(fn, "wb") as f:
f.write(r.content)

I just tried going to the following URLs:
http://imgur.com/rqCqA.jpg
http://imgur.com/rqCqA.png
http://imgur.com/rqCqA.gif
And they all worked. It seems that Imgur stores several types of the same image - you can take your pick.

I've used this before to download tons of xkcd webcomics and it seems to work for this as well.
def saveImage(url, fpath):
contents = urllib2.urlopen(url)
f = open(fpath, 'w')
f.write(contents.read())
f.close()
Hope this helps

You can parse the source of the page using BeautifulSoup or similar and look for img tags with the photo hash in the src. With your example, the pic is
<img alt="" src="http://i.imgur.com/rqCqA.jpg" original-title="">

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python webscraping: Image incomplete when using urllib - python

You should try requests: import requests url = 'https://i.stack.imgur.com/tkGEv.jpg?s=328&g=1' r = requests.get(url) with open('tkGEv.jpg', 'wb') as fd: for chunk in r.iter_content(chunk_size=128): fd.write(chunk)

Related

EASY PYTHON SELENIUM: How do I download an mp4 WITHOUT using urllib?

How do i download a image from a URL into a specific folder

is there a way to use python-requests to access web pages which are in fact pdfs?

Run URL to download a file with python

Downloading file from imgur using python directly via url

Categories

Resources