Error Web Scraping with ThreadPoolExecutor - python

I'm working on a simple web scraping program, but I can't even seem to download a simple set of pages and get their sizes.
Here is my code:
from concurrent.futures import ThreadPoolExecutor as Executor
urls = """reddit twitter tumblr instagram linkedin""".split()
def fetch(url):
from urllib import request, error
try:
data = request.urlopen(url).read()
return '{}: length {}'.format(url, len(data))
except error.HTTPError as e:
return '{}: {}'.format(url, e)
with Executor(max_workers=4) as exe:
template = 'http://www.{}.com'
jobs = [exe.submit(
fetch, template.format(u)) for u in urls]
results = [job.result() for job in jobs]
print('\n'.join(results))
In the command line I'm running
python scrape.py
but I'm getting the error
Traceback (most recent call last):
File "scrape.py", line 1, in
from concurrent.futures import ThreadPoolExecutor as Executor
ImportError: No module named concurrent.futures
What do I need to do to surmount this error?

Use Python 3.
https://docs.python.org/3/library/concurrent.futures.html
New in version 3.2.

Related

Why does concurrent.futures executor map throw error when using with futures.as_completed after all the futures are complete?

I am trying to send HTTP requests concurrently. In order to do so, I am using concurrent.futures
Here is simple code:
import requests
from concurrent import futures
data = range(10)
def send_request(item):
requests.get("https://httpbin.org/ip")
print("Request {} complete.".format(item))
executor = futures.ThreadPoolExecutor(max_workers=25)
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
If I run it, I can see requests are sent asynchronously, which is exactly what I want to do. However, when all the requests are complete, I get following error:
Request 0 complete.
Request 6 complete.
...
Request 7 complete.
Request 9 complete.
Request 3 complete.
Traceback (most recent call last):
File "send_thread.py", line 18, in <module>
for f in futures.as_completed(futures_):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 219, in as_completed
with _AcquireFutures(fs):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 146, in __enter__
future._condition.acquire()
AttributeError: 'NoneType' object has no attribute '_condition'
This is quite strange error. Here executor.map seems to be problematic. If I replace map with following line, it works as expected.
futures_ = [executor.submit(send_request, x) for x in data]
What am I missing? Tried to find difference between two, but can't seem to understand what could cause above issue. Any input would be highly appreciated.
Executor.map does not return you a list of futures but a generator of results, so instead of:
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
you should run:
results = executor.map(send_request, data)
for r in results:
print(r)

How do I get the response text from a treq request?

I am trying to get started with some example code of the treq library, to little avail. While it is easy to get the status code and a few other properties of the response to the request, getting the actual text of the response is a little more difficult. The print_response function available in this example code is not present in the version that I have:
from twisted.internet.task import react
from _utils import print_response
import treq
def main(reactor, *args):
d = treq.get('http://httpbin.org/get')
d.addCallback(print_response)
return d
react(main, [])
Here is the traceback:
Traceback (most recent call last):
File "test.py", line 2, in <module>
from _utils import print_response
ModuleNotFoundError: No module named '_utils'
I am not really sure where to go from here...any help would be greatly appreciated.
Now that I look at it, that example is extremely bad, especially if you're new to twisted. Please give this a try:
import treq
from twisted.internet import defer, task
def main(reactor):
d = treq.get('https://httpbin.org/get')
d.addCallback(print_results)
return d
#defer.inlineCallbacks
def print_results(response):
content = yield response.content()
text = yield response.text()
json = yield response.json()
print(content) # raw content in bytes
print(text) # encoded text
print(json) # JSON
task.react(main)
The only thing you really have to know is that the .content(),.text(),.json() return Deferred objects that eventually return the body of the response. For this reason, you need to yield or execute callbacks.
Let's say you only want the text content, you could this:
def main(reactor):
d = treq.get('https://httpbin.org/get')
d.addCallback(treq.text_content)
d.addCallback(print) # replace print with your own callback function
return d
The treq.content() line of functions make it easy to return only the content, if thats all you care about, and do stuff with it.

NameError in function to retrieve JSON data

I'm using python 3.6.1 and have the following code which successfully retrieves data in JSON format:
import urllib.request,json,pprint
url = "https://someurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
pprint.pprint(data)
I want to wrap this in a function, so i can reuse it. This is what i have tried in a file called getdata.py:
from urllib.request import urlopen
import json
def get_json_data(url):
response = urlopen(url)
return json.loads(response.read())
and this is the error i get after importing the file and attempting to print out the response:
>>> import getdata
>>> print(getdata.get_json_data("https://someurl"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Nick\getdata.py", line 6, in get_json_data
from urllib.request import urlopen
NameError: name 'urllib' is not defined
i also tried this and got the same error:
import urllib.request,json
def get_json_data(url):
response = urllib.request.urlopen(url)
return json.loads(response.read())
What do i need to do to get this to work please?
cheers
Its working now ! I think the problem was the hydrogen addon i have for the Atom editor. I uninstalled it, tried again and it worked. Thanks for looking.

Why does this python script work on my local machine but not on Heroku?

there. I'm building a simple scraping tool. Here's the code that I have for it.
from bs4 import BeautifulSoup
import requests
from lxml import html
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('Programming
4 Marketers-File-goes-here.json', scope)
site = 'http://nathanbarry.com/authority/'
hdr = {'User-Agent':'Mozilla/5.0'}
req = requests.get(site, headers=hdr)
soup = BeautifulSoup(req.content)
def getFullPrice(soup):
divs = soup.find_all('div', id='complete-package')
price = ""
for i in divs:
price = i.a
completePrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return completePrice
def getVideoPrice(soup):
divs = soup.find_all('div', id='video-package')
price = ""
for i in divs:
price = i.a
videoPrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return videoPrice
fullPrice = getFullPrice(soup)
videoPrice = getVideoPrice(soup)
date = datetime.date.today()
gc = gspread.authorize(credentials)
wks = gc.open("Authority Tracking").sheet1
row = len(wks.col_values(1))+1
wks.update_cell(row, 1, date)
wks.update_cell(row, 2, fullPrice)
wks.update_cell(row, 3, videoPrice)
This script runs on my local machine. But, when I deploy it as a part of an app to Heroku and try to run it, I get the following error:
Traceback (most recent call last):
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 219, in put_feed
r = self.session.put(url, data, headers=headers)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 82, in put
return self.request('PUT', url, params=params, data=data, **kwargs)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 69, in request
response.status_code, response.content))
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "AuthorityScraper.py", line 44, in
wks.update_cell(row, 1, date)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/models.py", line 517, in update_cell
self.client.put_feed(uri, ElementTree.tostring(feed))
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 221, in put_feed
if ex[0] == 403:
TypeError: 'RequestError' object does not support indexing
What do you think might be causing this error? Do you have any suggestions for how I can fix it?
There are a couple of things going on:
1) The Google Sheets API returned an error: "Invalid query parameter value for cell_id":
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
2) A bug in gspread caused an exception upon receipt of the error:
TypeError: 'RequestError' object does not support indexing
Python 3 removed __getitem__ from BaseException, which this gspread error handling relies on. This doesn't matter too much because it would have raised an UpdateCellError exception anyways.
My guess is that you are passing an invalid row number to update_cell. It would be helpful to add some debug logging to your script to show, for example, which row it is trying to update.
It may be better to start with a worksheet with zero rows and use append_row instead. However there does seem to be an outstanding issue in gspread with append_row, and it may actually be the same issue you are running into.
I encountered the same problem. BS4 works fine at a local machine. However, for some reason, it is way too slow in the Heroku server resulting into giving error.
I switched to lxml and it is working fine now.
Install it by command:
pip install lxml
A sample code snippet is given below:
from lxml import html
import requests
getpage = requests.get("https://url_here")
gethtmlcontent = html.fromstring(getpage.content)
data = gethtmlcontent.xpath('//div[#class = "class-name"]/text()')
#this is a sample for fetching data from the dummy div
data = data[0:n] # as per your requirement
#now inject the data into django tmeplate.

example urllib3 and threading in python

I am trying to use urllib3 in simple thread to fetch several wiki pages.
The script will
Create 1 connection for every thread (I don't understand why) and Hang forever.
Any tip, advice or simple example of urllib3 and threading
import threadpool
from urllib3 import connection_from_url
HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
def fetch(url, fiedls):
kwargs={'retries':6}
return HTTP_POOL.get_url(url, fields, **kwargs)
pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]
#Lennart's script got this error:
http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
After adding import threadpool; import urllib3 and tpool = threadpool.ThreadPool(4) #user318904's code got this error:
Traceback (most recent call last):
File "crawler.py", line 21, in <module>
tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'
Here is my take, a more current solution using Python3 and concurrent.futures.ThreadPoolExecutor.
import urllib3
from concurrent.futures import ThreadPoolExecutor
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
def download(url, cmanager):
response = cmanager.request('GET', url)
if response and response.status == 200:
print("+++++++++ url: " + url)
print(response.data[:1024])
connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
thread_pool.submit(download, url, connection_mgr)
Some remarks
My code is based on a similar example from the Python Cookbook by Beazley and Jones.
I particularly like the fact that you only need a standard module besides urllib3.
The setup is extremely simple, and if you are only going for side-effects in download (like printing, saving to a file, etc.), there is no additional effort in joining the threads.
If you want something different, ThreadPoolExecutor.submit actually returns whatever download would return, wrapped in a Future.
I found it helpful to align the number of threads in the thread pool with the number of HTTPConnection's in a connection pool (via maxsize). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).
Obviously it will create one connection per thread, how should else each thread be able to fetch a page? And you try to use the same connection, made from one url, for all urls. That can hardly be what you intended.
This code worked just fine:
import threadpool
from urllib3 import connection_from_url
def fetch(url):
kwargs={'retries':6}
conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
print url, conn.get_url(url)
print "Done!"
pool = threadpool.ThreadPool(4)
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
requests = threadpool.makeRequests(fetch, urls)
[pool.putRequest(req) for req in requests]
pool.wait()
Thread programming is hard, so I wrote workerpool to make exactly what you're doing easier.
More specifically, see the Mass Downloader example.
To do the same thing with urllib3, it looks something like this:
import urllib3
import workerpool
pool = urllib3.connection_from_url("foo", maxsize=3)
def download(url):
r = pool.get_url(url)
# TODO: Do something with r.data
print "Downloaded %s" % url
# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)
# The ``download`` method will be called with a line from the second
# parameter for each job.
pool.map(download, open("urls.txt").readlines())
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
For more sophisticated code, have a look at workerpool.EquippedWorker (and the tests here for example usage). You can make the pool be the toolbox you pass in.
I use something like this:
#excluding setup for threadpool etc
upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True)
urls = ['/wiki/2010-11_Premier_League',
'/wiki/List_of_MythBusters_episodes',
'/wiki/List_of_Top_Gear_episodes',
'/wiki/List_of_Unicode_characters',
]
def fetch(path):
# add error checking
return pool.get_url(path).data
tpool = ThreadPool()
tpool.map_async(fetch, urls)
# either wait on the result object or give map_async a callback function for the results

Categories