I have just written a small function to download and save some images to my hard disk. Now that some urls redirect and/or contain bad file extensions. I have added some validations, however, they cause the script to stop immediately as they hit a bad url. Now, I would like to modify the script a bit that loop continues discarding any bad urls, eventually breaking the loop as I successfully download an image. (Here I need to download just one image successfully). Can you please take a look at my code and share some tips? Thank you
from pattern.web import URL, DOM, plaintext, extension
import requests, re, os, sys, datetime, time, re, random
def download_single_image(query, folder, image_options=None):
download_fault = 0
url_link = None
valid_image_ext_list = ['.png', '.jpg', '.gif', '.bmp', '.tiff', 'jpeg'] # not comprehensive
pic_links = scrape_links(query, image_options) # pic_links contains an array of urls
for url in pic_links:
url = URL(url)
print "checking re-direction"
if url.redirect:
print "redirected, returning"
return # if there is a redirect, return
file_ext = extension(url.page)
print "checking file extension", file_ext
if file_ext.lower() not in valid_image_ext_list:
print "not a valid extension, returning"
return # return if not valid image extension found
# Download the image.
print('Downloading image %s... ' % (pic))
res = requests.get(pic)
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
print ('Saving image to %s...'% (folder))
if not os.path.exists(folder + '/' + os.path.basename(pic)):
imageFile = open(os.path.join(folder, os.path.basename(pic)), mode='wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
print('pic saved %s' % os.path.basename(pic))
else:
print('File already exists!')
return os.path.basename(pic)
Change this:
return # return if not valid image extension found
to this:
continue # return if not valid image extension found
First just aborts the loop, second skips to next step.
PS.File extensions in the world of Internet mean nothing... I would rather just send HEAD request with CURL to check if it's image or not (by content-type that servers returns).
Related
I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
I created a document site on plone from which file uploads can be made. I saw that plone saves them in the filesystem in the form of a blob, now I need to take them through a python script that will process the pdfs downloaded with an OCR. Does anyone have any idea how to do it? Thank you
Not sure how to extract PDFs from BLOB-storage or if it's possible at all, but you can extract them from a running Plone-site (e.g. executing the script via a browser-view):
import os
from Products.CMFCore.utils import getToolByName
def isPdf(search_result):
"""Check mime_type for Plone >= 5.1, otherwise check file-extension."""
if mimeTypeIsPdf(search_result) or search_result.id.endswith('.pdf'):
return True
return False
def mimeTypeIsPdf(search_result):
"""
Plone-5.1 introduced the mime_type-attribute on files.
Try to get it, if it doesn't exist, fail silently.
Return True if mime_type exists and is PDF, otherwise False.
"""
try:
mime_type = search_result.mime_type
if mime_type == 'application/pdf':
return True
except:
pass
return False
def exportPdfFiles(context, export_path):
"""
Get all PDF-files of site and write them to export_path on the filessytem.
Remain folder-structure of site.
"""
catalog = getToolByName(context, 'portal_catalog')
search_results = catalog(portal_type='File', Language='all')
for search_result in search_results:
# For each PDF-file:
if isPdf(search_result):
file_path = export_path + search_result.getPath()
file_content = search_result.getObject().data
parent_path = '/'.join(file_path.split('/')[:-1])
# Create missing directories on the fly:
if not os.path.exists(parent_path):
os.makedirs(parent_path)
# Write PDF:
with open(file_path, 'w') as fil:
fil.write(file_content)
print 'Wrote ' + file_path
print 'Finished exporting PDF-files to ' + export_path
The example keeps the folder-structure of the Plone-site in the export-directory. If you want them flat in one directory, a handler for duplicate file-names is needed.
OK I'm trying to scrape jpg image from Gucci website. Take this one as example.
http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg
I tried urllib.urlretrieve, which doesn't work becasue Gucci blocked the function. So I wanted to use requests to scrape the source code for the image and then write it into a .jpg file.
image = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg").text.encode('utf-8')
I encoded it because if I don't, it keeps telling me that gbk cannot encode the string.
Then:
with open('1.jpg', 'wb') as f:
f.write(image)
looks good right? But the result is -- the jpg file cannot be opened. There's no image! Windows tells me the jpg file is damaged.
What could be the problem?
I'm thinking that maybe when I scraped the image, I lost some information, or some characters are wrongly scraped. But how can I find out which?
I'm thinking that maybe some information is lost via encoding. But if I don't encode, I cannot even print it, not to mention writing it into a file.
What could go wrong?
I am not sure about the purpose of your use of encode. You're not working with text, you're working with an image. You need to access the response as binary data, not as text, and use image manipulation functions rather than text ones. Try this:
from PIL import Image
from io import BytesIO
import requests
response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
bytes = BytesIO(response.content)
image = Image.open(bytes)
image.save("1.jpg")
Note the use of response.content instead of response.text. You will need to have PIL or Pillow installed to use the Image module. BytesIO is included in Python 3.
Or you can just save the data straight to disk without looking at what's inside:
import requests
response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
with open('1.jpg','wb') as f:
f.write(response.content)
A JPEG file is not text, it's binary data. So you need to use the request.content attribute to access it.
The code below also includes a get_headers() function, which can be handy when you're exploring a Web site.
import requests
def get_headers(url):
resp = requests.head(url)
print("Status: %d" % resp.status_code)
resp.raise_for_status()
for t in resp.headers.items():
print('%-16s : %s' % t)
def download(url, fname):
''' Download url to fname '''
print("Downloading '%s' to '%s'" % (url, fname))
resp = requests.get(url)
resp.raise_for_status()
with open(fname, 'wb') as f:
f.write(resp.content)
def main():
site = 'http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/'
basename = '277520_F4CYG_4080_001_web_full_new_theme.jpg'
url = site + basename
fname = 'qtest.jpg'
try:
#get_headers(url)
download(url, fname)
except requests.exceptions.HTTPError as e:
print("%s '%s'" % (e, url))
if __name__ == '__main__':
main()
We call the .raise_for_status() method so that get_headers() and download() raise an Exception if something goes wrong; we catch the Exception in main() and print the relevant info.
How can I use the Python standard library to get a file object, silently ensuring it's up-to-date from some other location?
A program I'm working on needs to access a set of files locally; they're
just normal files.
But those files are local cached copies of documents available at remote
URLs — each file has a canonical URL for that file's content.
(I write here about HTTP URLs, but I'm looking for a solution that isn't specific to any particular remote fetching protocol.)
I'd like an API for ‘get_file_from_cache’ that looks something like:
file_urls = {
"/path/to/foo.txt": "http://example.org/spam/",
"other/path/bar.data": "https://example.net/beans/flonk.xml",
}
for (filename, url) in file_urls.items():
infile = get_file_from_cache(filename, canonical=url)
do_stuff_with(infile.read())
If the local file's modification timestamp is not significantly
earlier than the Last-Modified timestamp for the document at the
corresponding URL, get_file_from_cache just returns the file object
without changing the file.
The local file might be out of date (its modification timestamp may be
significantly older than the Last-Modified timestamp from the
corresponding URL). In that case, get_file_from_cache should first
read the document's contents into the file, then return the file
object.
The local file may not yet exist. In that case, get_file_from_cache
should first read the document content from the corresponding URL,
create the local file, and then return the file object.
The remote URL may not be available for some reason. In that case,
get_file_from_cache should simply return the file object, or if that
can't be done, raise an error.
So this is something similar to an HTTP object cache. Except where those
are usually URL-focussed with the local files a hidden implementation
detail, I want an API that focusses on the local files, with the remote
requests a hidden implementation detail.
Does anything like this exist in the Python library, or as simple code
using it? With or without the specifics of HTTP and URLs, is there some
generic caching recipe already implemented with the standard library?
This local file cache (ignoring the spcifics of URLs and network access)
seems like exactly the kind of thing that is easy to get wrong in
countless ways, and so should have a single obvious implementation
available.
Am I in luck? What do you advise?
From a quick Googling I couldn't find an existing library that can do, although I'd be surprised if there weren't such a thing. :)
Anyway, here's one way to do it using the popular Requests module. It'd be pretty easy to adapt this code to use urllib / urlib2, though.
#! /usr/bin/env python
''' Download a file if it doesn't yet exist in offline cache, or if the online
version is more than age seconds newer than the cached version.
Example code for
http://stackoverflow.com/questions/26436641/access-a-local-file-but-ensure-it-is-up-to-date
Written by PM 2Ring 2014.10.18
'''
import sys
import os
import email.utils
import requests
cache_path = 'offline_cache'
#Translate local file names in cache_path to URLs
file_urls = {
'example1.html': 'http://www.example.com/',
'badfile': 'http://httpbin.org/status/404',
'example2.html': 'http://www.example.org/index.html',
}
def get_headers(url):
resp = requests.head(url)
print "Status: %d" % resp.status_code
resp.raise_for_status()
for k,v in resp.headers.items():
print '%-16s : %s' % (k, v)
def get_url_mtime(url):
''' Get last modified time of an online file from the headers
and convert to a timestamp
'''
resp = requests.head(url)
resp.raise_for_status()
t = email.utils.parsedate_tz(resp.headers['last-modified'])
return email.utils.mktime_tz(t)
def download(url, fname):
''' Download url to fname, setting mtime of file to match url '''
print >>sys.stderr, "Downloading '%s' to '%s'" % (url, fname)
resp = requests.get(url)
#print "Status: %d" % resp.status_code
resp.raise_for_status()
t = email.utils.parsedate_tz(resp.headers['last-modified'])
timestamp = email.utils.mktime_tz(t)
#print 'last-modified', timestamp
with open(fname, 'wb') as f:
f.write(resp.content)
os.utime(fname, (timestamp, timestamp))
def open_cached(basename, mode='r', age=0):
''' Open a cached file.
Download it if it doesn't yet exist in cache, or if the online
version is more than age seconds newer than the cached version.'''
fname = os.path.join(cache_path, basename)
url = file_urls[basename]
#print fname, url
if os.path.exists(fname):
#Check if online version is sufficiently newer than offline version
file_mtime = os.path.getmtime(fname)
url_mtime = get_url_mtime(url)
if url_mtime > age + file_mtime:
download(url, fname)
else:
download(url, fname)
return open(fname, mode)
def main():
for fname in ('example1.html', 'badfile', 'example2.html'):
print fname
try:
with open_cached(fname, 'r') as f:
for i, line in enumerate(f, 1):
print "%3d: %s" % (i, line.rstrip())
except requests.exceptions.HTTPError, e:
print >>sys.stderr, "%s '%s' = '%s'" % (e, file_urls[fname], fname)
print
if __name__ == "__main__":
main()
Of course, for real-world use you should add some proper error checking.
You may notice that I've defined a function get_headers(url) which never gets called; I used it during development & figured it might come in handy when expanding this program, so I left it in. :)
I am using Django-media-tree to import images into a site image library. I am hitting a bug in PIL where some unknown EXIF data on the image is causing a non-handled exception in the generation of the thumbnail images. Rather than hacking around in PIL I am looking to simply remove all EXIF data from the image before it is handled by PIL.
Using chilkat.CkXmp() I am attempting to rewrite the image to a new directory in clean form, however the RemoveAllEmbedded() method returns None, and the image is rewritten with the EXIF data intact.
import os
import sys
import chilkat
ALLOWED_EXTENSIONS = ['.jpg', 'jpeg', '.png', '.gif', 'tiff']
def listdir_fullpath(d):
list = []
for f in os.listdir(d):
if len(f) > 3:
if f[-4:] in ALLOWED_EXTENSIONS:
list.append(os.path.join(d, f))
return list
def trim_xmp_data(file, dir):
xmp = chilkat.CkXmp()
success = xmp.UnlockComponent("Anything for 30-day trial.")
if (success != True):
print xmp.lastErrorText()
sys.exit()
success = xmp.LoadAppFile(file)
if (success != True):
print xmp.lastErrorText()
sys.exit()
print "Num embedded XMP docs: %d" % xmp.get_NumEmbedded()
xmp.RemoveAllEmbedded()
# Save the JPG.
fn = "%s/amended/%s" % (dir, file.rsplit('/')[-1])
success = xmp.SaveAppFile(fn)
if (success != True):
print xmp.lastErrorText()
sys.exit()
for item in listdir_fullpath('/Users/harrin2/Desktop/tmp/'):
trim_xmp_data(item, '/Users/harrin2/Desktop/tmp')
Can anyone tell me where I am going wrong, or if there is a better method of cleaning the images I am open to suggestions.....
TIA