How to implement concurrency in scrapy (Python) middleware

How to implement concurrency in scrapy (Python) middleware - python

Edit 2
Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.
class WebkitDownloader( object ):
def __init__(self):
os.environ["DISPLAY"] = ":99"
self.proxyAddress = "a:b#" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
def process_response(self, request, response, spider):
self.request = request
self.response = response
if 'cached' not in response.flags:
webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
#print "added to queue: " + str(self.counter)
webkitBrowser.get(html=response.body, num_retries=0)
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
webkitBrowser.setPage(None)
del webkitBrowser
return response
Edit:
I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.
Can anyone please point me into right direction to make this work?
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.queue = list()
for i in range(8):
self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))
def process_response(self, request, response, spider):
i = 0
for webkitBrowser in self.queue:
i += 1
if webkitBrowser.status == "WAITING":
break
webkitBrowser = self.queue[i]
if webkitBrowser.status == "WAITING":
# load webpage
print "added to queue: " + str(i)
webkitBrowser.get(html=response.body, num_retries=0)
webkitBrowser.scrapyResponse = response
while webkitBrowser.status == "PROCESSING":
print "waiting for queue: " + str(i)
if webkitBrowser.status == "DONE":
print "fetched from queue: " + str(i)
#response = webkitBrowser.scrapyResponse
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
#response = response.replace(**kwargs)
webkitBrowser.status = "WAITING"
return response
I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).
I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)
How would I achieve that with Python? This is my current middleware:
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
def process_response(self, request, response, spider):
if not ".pdf" in response.url:
# load webpage
self.w.get(html=response.body, num_retries=0)
html = self.w.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
return response

I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?
I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.
Here's a recent SO Q/A with example code that might help.
Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.

I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:
If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.

Related

Multiple simultaneous HTTP requests

I'm trying to take a list of items and check for their status change based on certain processing by the API. The list will be manually populated and can vary in number to several thousand.
I'm trying to write a script that makes multiple simultaneous connections to the API to keep checking for the status change. For each item, once the status changes, the attempts to check must stop. Based on reading other posts on Stackoverflow (Specifically, What is the fastest way to send 100,000 HTTP requests in Python? ), I've come up with the following code. But the script always stops after processing the list once. What am I doing wrong?
One additional issue that I'm facing is that the keyboard interrup method never fires (I'm trying with Ctrl+C but it does not kill the script.
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
requestURLBase = "https://example.com/api"
apiKey = "123456"
concurrent = 200
keepTrying = 1
def doWork():
while keepTrying == 1:
url = q.get()
status, body, url = checkStatus(url)
checkResult(status, body, url)
q.task_done()
def checkStatus(ourl):
try:
url = urlparse(ourl)
conn = httplib.HTTPConnection(requestURLBase)
conn.request("GET", url.path)
res = conn.getresponse()
respBody = res.read()
conn.close()
return res.status, respBody, ourl #Status can be 210 for error or 300 for successful API response
except:
print "ErrorBlock"
print res.read()
conn.close()
return "error", "error", ourl
def checkResult(status, body, url):
if "unavailable" not in body:
print status, body, url
keepTrying = 1
else:
keepTrying = 0
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
except KeyboardInterrupt:
sys.exit(1)
I'm new to Python so there could be syntax errors as well... I'm definitely not familiar with multi-threading so perhaps I'm doing something else wrong as well.

In the code, the list is only read once. Should be something like
try:
while True:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
For the interrupt thing, remove the bare except line in checkStatus or make it except Exception. Bare excepts will catch all exceptions, including SystemExit which is what sys.exit raises and stop the python process from terminating.
If I may make a couple comments in general though.
Threading is not a good implementation for such large concurrencies
Creating a new connection every time is not efficient
What I would suggest is
Use gevent for asynchronous network I/O
Pre-allocate a queue of connections same size as concurrency number and have checkStatus grab a connection object when it needs to make a call. That way the connections stay alive, get reused and there is no overhead in creating and destroying them and the increased memory use that goes with it.

yielding asynchronously doesn't work. tested with DHC chrome client plugin and ajax code

I have an api code snippet:
#app.route("/do_something", method=['POST', 'OPTIONS'])
#CORS is enabled
def initiate_trade():
'''
post json
some Args: *input
'''
if request.method == 'OPTIONS':
yield {}
else:
response.headers['Content-type'] = 'application/json'
data = (request.json)
print data
for dump in json.dumps(function(input)): yield dump
The corresponding function is:
def function(*input):
#========= All about processing foo input ==========#
....
#========= All about processing foo input ends ==========#
worker = []
for this in foo_data:
#doing something
for _ in xrange(this):
#doing smthng again
worker.append(gevent.spawn(foo_fi, args))
result = gevent.joinall(worker)
some_dict.update({this: [t.value for t in worker]})
gevent.killall(worker)
worker = []
yield {this:some_dict[this]}
#gevent.sleep(2)
When I run the DHC rest client, w/o the gevent.sleep(2), it gives everything as if a synchronous return value. BUT, with the gevent.sleep(2) uncommented, nothing gets back.
What's wrong?
I thought sleep will cause a delay and "dump" value will be streamed one by one as is available.
Also im no javascript guy but I can read the code somewhat. But even ajax wouldn't receive the code if the server code is not being returned. So I am assuming that negates any possibilities of client side code malfunction and has everything to do with this code snippet.
Please note that instead of yielding, if I just return the value as
def function(*input):
.
.
return some_dict
and on api side I do:
return json.dumps(function(input))
then everything works fine on the client side.

speed up a HTTP request python and 500 error

I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
import csv
URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'
def run(**params):
countryFile = open("EgyptDaybyDay.csv","a")
i=1
results = True
while results:
params["index"]=str(i)
response = requests.get(URL.format(**params))
print response.status_code
htmlFile = BeautifulSoup(response.content)
articles = htmlFile.findAll("div", { "class" : "newslist" })
for article in articles:
url = (article.a['href']).encode('utf-8','ignore')
title = (article.img['alt']).encode('utf-8','ignore')
dateline = article.find("div",{"class": "floatright"})
m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
date = m.group(1)
w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, url ))
if not articles:
results = False
i+=1
countryFile.close()
run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

This is a good opportunity to try out gevent.
You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.
You can then spawn multiple workers and have queues to pass requests and articles around.
Maybe something similar to this:
import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()
MAX_REQUESTS = 10
requests = Queue(MAX_REQUESTS)
articles = Queue()
mock_responses = range(100)
mock_responses.reverse()
def request():
print "worker started"
while True:
print "request %s" % requests.get()
sleep(1)
try:
articles.put('article response %s' % mock_responses.pop())
except IndexError:
articles.put(StopIteration)
break
def run():
print "run"
i = 1
while True:
requests.put(i)
i += 1
if __name__ == '__main__':
for worker in range(MAX_REQUESTS):
gevent.spawn(request)
gevent.spawn(run)
for article in articles:
print "Got article: %s" % article

The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this

It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.
If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.

You can try making all the calls asynchronously .
Have a look at this :
http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html
You could use gevent as well rather than twisted but just telling you the options.

This might very well come close to what you're looking for.
Ideal method for sending multiple HTTP requests over Python? [duplicate]
Source code:
https://github.com/kennethreitz/grequests

Launching asynchronous, Node.js-like HTTP request in Tornado

I've previously written applications, specifically data scrapers, in Node.js. These types of applications had no web front end, but were merely processes timed with cron jobs to asynchronously make a number of possibly complicated HTTP GET requests to pull web pages, and then scrape and store the data from the results.
A sample of a function I might write would be this:
// Node.js
var request = require("request");
function scrapeEverything() {
var listOfIds = [23423, 52356, 63462, 34673, 67436];
for (var i = 0; i < listOfIds.length; i++) {
request({uri: "http://mydatasite.com/?data_id = " + listOfIds[i]},
function(err, response, body) {
var jsonobj = JSON.parse(body);
storeMyData(jsonobj);
});
}
}
This function loops through the IDs and makes a bunch of asynchronous GET requests, from which it then stores the data.
I'm now writing a scraper in Python and attempting to do the same thing using Tornado, but everything I see in the documentation refers to Tornado acting as a web server, which is not what I'm looking for. Anyone know how to do this?

Slightly more involved answer than I thought I would throw together, but it's a quick demo of how to use Tornado ioloop and AsyncHTTPClient to fetch some data. I've actually written a webcrawler in Tornado, so it can be used "headless".
import tornado.ioloop
import tornado.httpclient
class Fetcher(object):
def __init__(self, ioloop):
self.ioloop = ioloop
self.client = tornado.httpclient.AsyncHTTPClient(io_loop=ioloop)
def fetch(self, url):
self.client.fetch(url, self.handle_response)
#property
def active(self):
"""True if there are active fetching happening"""
return len(self.client.active) != 0
def handle_response(self, response):
if response.error:
print "Error:", response.error
else:
print "Got %d bytes" % (len(response.body))
if not self.active:
self.ioloop.stop()
def main():
ioloop = tornado.ioloop.IOLoop.instance()
ioloop.add_callback(scrapeEverything)
ioloop.start()
def scrapeEverything():
fetcher = Fetcher(tornado.ioloop.IOLoop.instance())
listOfIds = [23423, 52356, 63462, 34673, 67436]
for id in listOfIds:
fetcher.fetch("http://mydatasite.com/?data_id=%d" % id)
if __name__ == '__main__':
main()

If you are open to alternatives to tornado (I assume you scrape using socket programming, instead of urllib2), you may be interested in asyncoro, a framework for asynchronous, concurrent (and distributed, fault-tolerant) programming. Programming with asyncoro is very similar to that of threads, except for a few syntactic changes. Your problem can be implemented with asyncoro as:
import asyncoro, socket
def process(url, coro=None):
# create asynchronous socket
sock = asyncoro.AsynCoroSocket(socket.socket())
# parse url to get host, port; prepare get_request
yield sock.connect((host, port))
yield sock.send(get_request)
body = yield sock.recv()
# ...
# process body
for i in [23423, 52356, 63462, 34673, 67436]:
asyncoro.Coro(process, "http://mydatasite.com/?data_id = %s" % i)

You can also try native solution that do not require any external library. For linux it is based on epoll and may look like this. Usage example:
# ------------------------------------------------------------------------------------
def sampleCallback(status, data, request):
print 'fetched:', status, len(data)
print data
# ------------------------------------------------------------------------------------
fetch(HttpRequest('google.com:80', 'GET', '/', None, sampleCallback))

Python urllib cache

I'm writing a script in Python that should determine if it has internet access.
import urllib
CHECK_PAGE = "http://64.37.51.146/check.txt"
CHECK_VALUE = "true\n"
PROXY_VALUE = "Privoxy"
OFFLINE_VALUE = ""
page = urllib.urlopen(CHECK_PAGE)
response = page.read()
page.close()
if response.find(PROXY_VALUE) != -1:
urllib.getproxies = lambda x = None: {}
page = urllib.urlopen(CHECK_PAGE)
response = page.read()
page.close()
if response != CHECK_VALUE:
print "'" + response + "' != '" + CHECK_VALUE + "'" #
else:
print "You are online!"
I use a proxy on my computer, so correct proxy handling is important. If it can't connect to the internet through the proxy, it should bypass the proxy and see if it's stuck at a login page (as many public hotspots I use do). With that code, if I am not connected to the internet, the first read() returns the proxy's error page. But when I bypass the proxy after that, I get the same page. If I bypass the proxy BEFORE making any requests, I get an error like I should. I think Python is caching the page from the 1st time around.
How do I force Python to clear its cache (or is this some other problem)?

Call urllib.urlcleanup() before each call of urllib.urlopen() will solve the problem. Actually, urllib.urlopen will call urlretrive() function which creates a cache to hold data, and urlcleanup() will remove it.

You want
page = urllib.urlopen(CHECK_PAGE, proxies={})
Remove the
urllib.getproxies = lambda x = None: {}
line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement concurrency in scrapy (Python) middleware - python

Related

Multiple simultaneous HTTP requests

yielding asynchronously doesn't work. tested with DHC chrome client plugin and ajax code

speed up a HTTP request python and 500 error

Launching asynchronous, Node.js-like HTTP request in Tornado

Python urllib cache

Categories

Resources