Is there a way to restart the download if the download get stuck at XX%? I'm trying to do scraping and download quite a lot of files. I'm using the below code. It will solve connection error, but it won't restart any download if it get stuck.
for element in elements:
for attempt in range(100):
try:
wget.download(element.get_attribute("href"), path)
except:
print("attempt error, retry" + str(attempt))
else:
break
Seems there is no feature to restart the download. I looked at many examples of this package -> https://www.programcreek.com/python/example/83386/wget.download. The page for the manual is gone and the pypi.org page does have have any info about a feature like this.
However, you can restart the download simply by adding another line to the except. This code will work for you.
# Set some variables to end loop after download success
# The download loop will exit if failed 5 times
downloaded = False
attempts = 0
for element in elements:
while not downloaded and attempts < 5:
try:
wget.download(element.get_attribute("href"), path)
# Set downloaded flag to end loop
downloaded = True
except:
print("attempt error, retry" + str(attempt))
wget.download(element.get_attribute("href"), path)
attempts += 1
Another approach it to use requests library which is more popular.
import requests
def proceed_files():
# Set some variables to end loop after download success
# The download loop will exit if failed 5 times
file_urls = ['list', 'of', 'file urls']
for url in file_urls:
downloaded = False
attempts = 0
while not downloaded and attempts < 5:
if download_file(url):
downloaded = True
else:
attempts += 1
def download_file(url):
try:
request = requests.get(url, allow_redirects=True)
file_name = url.split('/')[:-1]
open(file_name, 'wb').write(request.content)
return True
except:
return False
I am trying to Download html pages with my python script & TOR proxy server. It is running well. But extremely slow & Code is not organized so my IP is renewing most of the time rather downloading pages much. How can I speed the downloading with TOR? How can I organize the code efficiency.
Two script is there. Script1 is executed to download html pages from the website & after get block from the website, Script2 has to be executed to renew the IP with help of TOR proxy. So on... IP gets blocked after few seconds.
Should I lower my threading? How ? Please help me to speed up the process. I am getting only 300-500 html pages per hour.
Here is my Full Code of Script1:
# -*- coding: UTF-8 -*-
import os
import sys
import socks
import socket
import subprocess
import time
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, '127.0.0.1', 9050, True)
socket.socket = socks.socksocket
import urllib2
class WebPage:
def __init__(self, path, country, url, lower=0,upper=9999):
self.dir = str(path)+"/"+ str(country)
self.dir =os.path.join(str(path),str(country))
self.url = url
try:
fin = open(self.dir+"/limit.txt",'r')
limit = fin.readline()
limits = str(limit).split(",")
lower = int(limits[0])
upper = int(limits[1])
fin.close()
except:
fout = open(self.dir+"/limit.txt",'wb')
limits = str(lower)+","+str(upper)
fout.write(limits)
fout.close()
self.process_instances(lower,upper)
def process_instances(self,lower,upper):
try:
os.stat(self.dir)
except:
os.mkdir(self.dir)
for count in range(lower,upper+1):
if count == upper:
print "all downloaded, quitting the app!!"
break
targetURL = self.url+"/"+str(count)
print "Downloading :" + targetURL
req = urllib2.Request(targetURL)
try:
response = urllib2.urlopen(req)
the_page = response.read()
if the_page.find("Your IP suspended")>=0:
print "The IP is suspended"
fout = open(self.dir+"/limit.txt",'wb')
limits = str(count)+","+str(upper)
fout.write(limits)
fout.close()
break
if the_page.find("Too many requests")>=0:
print "Too many requests"
print "Renew IP...."
fout = open(self.dir+"/limit.txt",'wb')
limits = str(count)+","+str(upper)
fout.write(limits)
fout.close()
subprocess.Popen("C:\Users\John\Desktop\Data-Mine\yp\lol\lol2.py", shell=True)
time.sleep(2)
subprocess.call('lol1.py')
if the_page.find("404 error")>=0:
print "the page not exist"
continue
self.saveHTML(count, the_page)
except:
print "The URL cannot be fetched"
execfile('lol1.py')
pass
#continue
raise
def saveHTML(self,count, content):
fout = open(self.dir+"/"+str(count)+".html",'wb')
fout.write(content)
fout.close()
if __name__ == '__main__':
if len(sys.argv) !=6:
print "cannot process!!! Five Parameters are required to run the process."
print "Parameter 1 should be the path where to save the data, eg, /Users/john/data/"
print "Parameter 2 should be the name of the country for which data is collected, eg, japan"
print "Parameter 3 should be the URL from which the data to collect, eg, the website link"
print "Parameter 4 should be the lower limit of the company id, eg, 11 "
print "Parameter 5 should be the upper limit of the company id, eg, 1000 "
print "The output will be saved as the HTML file for each company in the target folder's country"
exit()
else:
path = str(sys.argv[1])
country = str(sys.argv[2])
url = str(sys.argv[3])
lowerlimit = int(sys.argv[4])
upperlimit = int(sys.argv[5])
WebPage(path, country, url, lowerlimit,upperlimit)
TOR is very slow, so it is to be expected that you don't get that much pages per hour. There are however some ways to speed it up. Most notably you could turn on GZIP compression for urllib (see this question for example) to improve the speed a little bit.
TOR as a protocol has rather low bandwidth, because the data needs to be relayed a few times and each relay must use its bandwidth for your request. If data is relayed 6 times - a rather probable number - you would need 6 times the bandwidth. GZIP compression can compress HTML to (in some cases) ~10% of the original size so that will probably speed up the process.
I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer
How can I modify my script to skip a URL if the connection times out or is invalid/404?
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
To apply the timeout to your request add the timeout variable to your call to urlopen. From the docs:
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
Refer to this guide's section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.
The request timeout exception code is 408. Wrapping it up, if you were to handle timeout exceptions you would:
try:
response = urlopen(req, 3) # 3 seconds
except URLError, e:
if hasattr(e, 'code'):
if e.code==408:
print 'Timeout ', e.code
if e.code==404:
print 'File Not Found ', e.code
# etc etc
Try putting your urlopen line under a try catch statment. Look this up:
docs.python.org/tutorial/errors.html section 8.3
Look at the different exceptions and when you encounter one just restart the loop using the statement continue
What I want to achieve is to get a website screenshot from any website in python.
Env: Linux
Here is a simple solution using webkit:
http://webscraping.com/blog/Webpage-screenshots-with-webkit/
import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
class Screenshot(QWebView):
def __init__(self):
self.app = QApplication(sys.argv)
QWebView.__init__(self)
self._loaded = False
self.loadFinished.connect(self._loadFinished)
def capture(self, url, output_file):
self.load(QUrl(url))
self.wait_load()
# set to webpage size
frame = self.page().mainFrame()
self.page().setViewportSize(frame.contentsSize())
# render image
image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
painter = QPainter(image)
frame.render(painter)
painter.end()
print 'saving', output_file
image.save(output_file)
def wait_load(self, delay=0):
# process app events until page loaded
while not self._loaded:
self.app.processEvents()
time.sleep(delay)
self._loaded = False
def _loadFinished(self, result):
self._loaded = True
s = Screenshot()
s.capture('http://webscraping.com', 'website.png')
s.capture('http://webscraping.com/blog', 'blog.png')
Here is my solution by grabbing help from various sources. It takes full web page screen capture and it crops it (optional) and generates thumbnail from the cropped image also. Following are the requirements:
Requirements:
Install NodeJS
Using Node's package manager install phantomjs: npm -g install phantomjs
Install selenium (in your virtualenv, if you are using that)
Install imageMagick
Add phantomjs to system path (on windows)
import os
from subprocess import Popen, PIPE
from selenium import webdriver
abspath = lambda *p: os.path.abspath(os.path.join(*p))
ROOT = abspath(os.path.dirname(__file__))
def execute_command(command):
result = Popen(command, shell=True, stdout=PIPE).stdout.read()
if len(result) > 0 and not result.isspace():
raise Exception(result)
def do_screen_capturing(url, screen_path, width, height):
print "Capturing screen.."
driver = webdriver.PhantomJS()
# it save service log file in same directory
# if you want to have log file stored else where
# initialize the webdriver.PhantomJS() as
# driver = webdriver.PhantomJS(service_log_path='/var/log/phantomjs/ghostdriver.log')
driver.set_script_timeout(30)
if width and height:
driver.set_window_size(width, height)
driver.get(url)
driver.save_screenshot(screen_path)
def do_crop(params):
print "Croping captured image.."
command = [
'convert',
params['screen_path'],
'-crop', '%sx%s+0+0' % (params['width'], params['height']),
params['crop_path']
]
execute_command(' '.join(command))
def do_thumbnail(params):
print "Generating thumbnail from croped captured image.."
command = [
'convert',
params['crop_path'],
'-filter', 'Lanczos',
'-thumbnail', '%sx%s' % (params['width'], params['height']),
params['thumbnail_path']
]
execute_command(' '.join(command))
def get_screen_shot(**kwargs):
url = kwargs['url']
width = int(kwargs.get('width', 1024)) # screen width to capture
height = int(kwargs.get('height', 768)) # screen height to capture
filename = kwargs.get('filename', 'screen.png') # file name e.g. screen.png
path = kwargs.get('path', ROOT) # directory path to store screen
crop = kwargs.get('crop', False) # crop the captured screen
crop_width = int(kwargs.get('crop_width', width)) # the width of crop screen
crop_height = int(kwargs.get('crop_height', height)) # the height of crop screen
crop_replace = kwargs.get('crop_replace', False) # does crop image replace original screen capture?
thumbnail = kwargs.get('thumbnail', False) # generate thumbnail from screen, requires crop=True
thumbnail_width = int(kwargs.get('thumbnail_width', width)) # the width of thumbnail
thumbnail_height = int(kwargs.get('thumbnail_height', height)) # the height of thumbnail
thumbnail_replace = kwargs.get('thumbnail_replace', False) # does thumbnail image replace crop image?
screen_path = abspath(path, filename)
crop_path = thumbnail_path = screen_path
if thumbnail and not crop:
raise Exception, 'Thumnail generation requires crop image, set crop=True'
do_screen_capturing(url, screen_path, width, height)
if crop:
if not crop_replace:
crop_path = abspath(path, 'crop_'+filename)
params = {
'width': crop_width, 'height': crop_height,
'crop_path': crop_path, 'screen_path': screen_path}
do_crop(params)
if thumbnail:
if not thumbnail_replace:
thumbnail_path = abspath(path, 'thumbnail_'+filename)
params = {
'width': thumbnail_width, 'height': thumbnail_height,
'thumbnail_path': thumbnail_path, 'crop_path': crop_path}
do_thumbnail(params)
return screen_path, crop_path, thumbnail_path
if __name__ == '__main__':
'''
Requirements:
Install NodeJS
Using Node's package manager install phantomjs: npm -g install phantomjs
install selenium (in your virtualenv, if you are using that)
install imageMagick
add phantomjs to system path (on windows)
'''
url = 'http://stackoverflow.com/questions/1197172/how-can-i-take-a-screenshot-image-of-a-website-using-python'
screen_path, crop_path, thumbnail_path = get_screen_shot(
url=url, filename='sof.png',
crop=True, crop_replace=False,
thumbnail=True, thumbnail_replace=False,
thumbnail_width=200, thumbnail_height=150,
)
These are the generated images:
Full web page screen
Cropped image from captured screen
Thumbnail of a cropped image
can do using Selenium
from selenium import webdriver
DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()
https://sites.google.com/a/chromium.org/chromedriver/getting-started
On the Mac, there's webkit2png and on Linux+KDE, you can use khtml2png. I've tried the former and it works quite well, and heard of the latter being put to use.
I recently came across QtWebKit which claims to be cross platform (Qt rolled WebKit into their library, I guess). But I've never tried it, so I can't tell you much more.
The QtWebKit links shows how to access from Python. You should be able to at least use subprocess to do the same with the others.
11 years later...
Taking a website screenshot using Python3.6 and Google PageSpeedApi Insights v5:
import base64
import requests
import traceback
import urllib.parse as ul
# It's possible to make requests without the api key, but the number of requests is very limited
url = "https://duckgo.com"
urle = ul.quote_plus(url)
image_path = "duckgo.jpg"
key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
strategy = "desktop" # "mobile"
u = f"https://www.googleapis.com/pagespeedonline/v5/runPagespeed?key={key}&strategy={strategy}&url={urle}"
try:
j = requests.get(u).json()
ss_encoded = j['lighthouseResult']['audits']['final-screenshot']['details']['data'].replace("data:image/jpeg;base64,", "")
ss_decoded = base64.b64decode(ss_encoded)
with open(image_path, 'wb+') as f:
f.write(ss_decoded)
except :
print(traceback.format_exc())
exit(1)
Notes:
Live Demo
Pros: Free
Cons: Low Resolution
Get API Key
Docs
Limits:
Queries per day = 25,000
Queries per 100 seconds = 400
Using Rendertron is an option. Under the hood, this is a headless Chrome exposing the following endpoints:
/render/:url: Access this route e.g. with requests.get if you are interested in the DOM.
/screenshot/:url: Access this route if you are interested in a screenshot.
You would install rendertron with npm, run rendertron in one terminal, access http://localhost:3000/screenshot/:url and save the file, but a demo is available at render-tron.appspot.com making it possible to run this Python3 snippet locally without installing the npm package:
import requests
BASE = 'https://render-tron.appspot.com/screenshot/'
url = 'https://google.com'
path = 'target.jpg'
response = requests.get(BASE + url, stream=True)
# save file, see https://stackoverflow.com/a/13137873/7665691
if response.status_code == 200:
with open(path, 'wb') as file:
for chunk in response:
file.write(chunk)
I can't comment on ars's answer, but I actually got Roland Tapken's code running using QtWebkit and it works quite well.
Just wanted to confirm that what Roland posts on his blog works great on Ubuntu. Our production version ended up not using any of what he wrote but we are using the PyQt/QtWebKit bindings with much success.
Note: The URL used to be: http://www.blogs.uni-osnabrueck.de/rotapken/2008/12/03/create-screenshots-of-a-web-page-using-python-and-qtwebkit/ I've updated it with a working copy.
This is an old question and most answers are a bit dated.
Currently, I would do 1 of 2 things.
1. Create a program that takes the screenshots
I would use Pyppeteer to take screenshots of websites. This runs on the Puppeteer package. Puppeteer spins up a headless chrome browser, so the screenshots will look exactly like they would in a normal browser.
This is taken from the pyppeteer documentation:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://example.com')
await page.screenshot({'path': 'example.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
2. Use a screenshot API
You could also use a screenshot API such as this one.
The nice thing is that you don't have to set everything up yourself but can simply call an API endpoint.
This is taken from the screenshot API's documentation:
import urllib.parse
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# The parameters.
token = "YOUR_API_TOKEN"
url = urllib.parse.quote_plus("https://example.com")
width = 1920
height = 1080
output = "image"
# Create the query URL.
query = "https://screenshotapi.net/api/v1/screenshot"
query += "?token=%s&url=%s&width=%d&height=%d&output=%s" % (token, url, width, height, output)
# Call the API.
urllib.request.urlretrieve(query, "./example.png")
Using a web service s-shot.ru (so it's not so fast), but quite easy to set up what need through the link configuration.
And you can easily capture full page screenshots
import requests
import urllib.parse
BASE = 'https://mini.s-shot.ru/1024x0/JPEG/1024/Z100/?' # you can modify size, format, zoom
url = 'https://stackoverflow.com/'#or whatever link you need
url = urllib.parse.quote_plus(url) #service needs link to be joined in encoded format
print(url)
path = 'target1.jpg'
response = requests.get(BASE + url, stream=True)
if response.status_code == 200:
with open(path, 'wb') as file:
for chunk in response:
file.write(chunk)
You can use Google Page Speed API to achieve your task easily. In my current project, I have used Google Page Speed API`s query written in Python to capture screenshots of any Web URL provided and save it to a location. Have a look.
import urllib2
import json
import base64
import sys
import requests
import os
import errno
# The website's URL as an Input
site = sys.argv[1]
imagePath = sys.argv[2]
# The Google API. Remove "&strategy=mobile" for a desktop screenshot
api = "https://www.googleapis.com/pagespeedonline/v1/runPagespeed?screenshot=true&strategy=mobile&url=" + urllib2.quote(site)
# Get the results from Google
try:
site_data = json.load(urllib2.urlopen(api))
except urllib2.URLError:
print "Unable to retreive data"
sys.exit()
try:
screenshot_encoded = site_data['screenshot']['data']
except ValueError:
print "Invalid JSON encountered."
sys.exit()
# Google has a weird way of encoding the Base64 data
screenshot_encoded = screenshot_encoded.replace("_", "/")
screenshot_encoded = screenshot_encoded.replace("-", "+")
# Decode the Base64 data
screenshot_decoded = base64.b64decode(screenshot_encoded)
if not os.path.exists(os.path.dirname(impagepath)):
try:
os.makedirs(os.path.dirname(impagepath))
except OSError as exc:
if exc.errno != errno.EEXIST:
raise
# Save the file
with open(imagePath, 'w') as file_:
file_.write(screenshot_decoded)
Unfortunately, following are the drawbacks. If these do not matter, you can proceed with Google Page Speed API. It works well.
The maximum width is 320px
According to Google API Quota, there is a limit of 25,000 requests per day
You don't mention what environment you're running in, which makes a big difference because there isn't a pure Python web browser that's capable of rendering HTML.
But if you're using a Mac, I've used webkit2png with great success. If not, as others have pointed out there are plenty of options.
I created a library called pywebcapture that wraps selenium that will do just that:
pip install pywebcapture
Once you install with pip, you can do the following to easily get full size screenshots:
# import modules
from pywebcapture import loader, driver
# load csv with urls
csv_file = loader.CSVLoader("csv_file_with_urls.csv", has_header_bool, url_column, optional_filename_column)
uri_dict = csv_file.get_uri_dict()
# create instance of the driver and run
d = driver.Driver("path/to/webdriver/", output_filepath, delay, uri_dict)
d.run()
Enjoy!
https://pypi.org/project/pywebcapture/
Try this..
#!/usr/bin/env python
import gtk.gdk
import time
import random
while 1 :
# generate a random time between 120 and 300 sec
random_time = random.randrange(120,300)
# wait between 120 and 300 seconds (or between 2 and 5 minutes)
print "Next picture in: %.2f minutes" % (float(random_time) / 60)
time.sleep(random_time)
w = gtk.gdk.get_default_root_window()
sz = w.get_size()
print "The size of the window is %d x %d" % sz
pb = gtk.gdk.Pixbuf(gtk.gdk.COLORSPACE_RGB,False,8,sz[0],sz[1])
pb = pb.get_from_drawable(w,w.get_colormap(),0,0,0,0,sz[0],sz[1])
ts = time.time()
filename = "screenshot"
filename += str(ts)
filename += ".png"
if (pb != None):
pb.save(filename,"png")
print "Screenshot saved to "+filename
else:
print "Unable to get the screenshot."
import subprocess
def screenshots(url, name):
subprocess.run('webkit2png -F -o {} {} -D ./screens'.format(name, url),
shell=True)