I've previously written applications, specifically data scrapers, in Node.js. These types of applications had no web front end, but were merely processes timed with cron jobs to asynchronously make a number of possibly complicated HTTP GET requests to pull web pages, and then scrape and store the data from the results.
A sample of a function I might write would be this:
// Node.js
var request = require("request");
function scrapeEverything() {
var listOfIds = [23423, 52356, 63462, 34673, 67436];
for (var i = 0; i < listOfIds.length; i++) {
request({uri: "http://mydatasite.com/?data_id = " + listOfIds[i]},
function(err, response, body) {
var jsonobj = JSON.parse(body);
storeMyData(jsonobj);
});
}
}
This function loops through the IDs and makes a bunch of asynchronous GET requests, from which it then stores the data.
I'm now writing a scraper in Python and attempting to do the same thing using Tornado, but everything I see in the documentation refers to Tornado acting as a web server, which is not what I'm looking for. Anyone know how to do this?
Slightly more involved answer than I thought I would throw together, but it's a quick demo of how to use Tornado ioloop and AsyncHTTPClient to fetch some data. I've actually written a webcrawler in Tornado, so it can be used "headless".
import tornado.ioloop
import tornado.httpclient
class Fetcher(object):
def __init__(self, ioloop):
self.ioloop = ioloop
self.client = tornado.httpclient.AsyncHTTPClient(io_loop=ioloop)
def fetch(self, url):
self.client.fetch(url, self.handle_response)
#property
def active(self):
"""True if there are active fetching happening"""
return len(self.client.active) != 0
def handle_response(self, response):
if response.error:
print "Error:", response.error
else:
print "Got %d bytes" % (len(response.body))
if not self.active:
self.ioloop.stop()
def main():
ioloop = tornado.ioloop.IOLoop.instance()
ioloop.add_callback(scrapeEverything)
ioloop.start()
def scrapeEverything():
fetcher = Fetcher(tornado.ioloop.IOLoop.instance())
listOfIds = [23423, 52356, 63462, 34673, 67436]
for id in listOfIds:
fetcher.fetch("http://mydatasite.com/?data_id=%d" % id)
if __name__ == '__main__':
main()
If you are open to alternatives to tornado (I assume you scrape using socket programming, instead of urllib2), you may be interested in asyncoro, a framework for asynchronous, concurrent (and distributed, fault-tolerant) programming. Programming with asyncoro is very similar to that of threads, except for a few syntactic changes. Your problem can be implemented with asyncoro as:
import asyncoro, socket
def process(url, coro=None):
# create asynchronous socket
sock = asyncoro.AsynCoroSocket(socket.socket())
# parse url to get host, port; prepare get_request
yield sock.connect((host, port))
yield sock.send(get_request)
body = yield sock.recv()
# ...
# process body
for i in [23423, 52356, 63462, 34673, 67436]:
asyncoro.Coro(process, "http://mydatasite.com/?data_id = %s" % i)
You can also try native solution that do not require any external library. For linux it is based on epoll and may look like this. Usage example:
# ------------------------------------------------------------------------------------
def sampleCallback(status, data, request):
print 'fetched:', status, len(data)
print data
# ------------------------------------------------------------------------------------
fetch(HttpRequest('google.com:80', 'GET', '/', None, sampleCallback))
Related
I'm following this Route_Guide sample.
The sample in question fires off and reads messages without replying to a specific message. The latter is what i'm trying to achieve.
Here's what i have so far:
import grpc
...
channel = grpc.insecure_channel(conn_str)
try:
grpc.channel_ready_future(channel).result(timeout=5)
except grpc.FutureTimeoutError:
sys.exit('Error connecting to server')
else:
stub = MyService_pb2_grpc.MyServiceStub(channel)
print('Connected to gRPC server.')
this_is_just_read_maybe(stub)
def this_is_just_read_maybe(stub):
responses = stub.MyEventStream(stream())
for response in responses:
print(f'Received message: {response}')
if response.something:
# okay, now what? how do i send a message here?
def stream():
yield my_start_stream_msg
# this is fine, i receive this server-side
# but i can't check for incoming messages here
I don't seem to have a read() or write() on the stub, everything seems to be implemented with iterators.
How do i send a message from this_is_just_read_maybe(stub)?
Is that even the right approach?
My Proto is a bidirectional stream:
service MyService {
rpc MyEventStream (stream StreamingMessage) returns (stream StreamingMessage) {}
}
What you're trying to do is perfectly possible and will probably involve writing your own request iterator object that can be given responses as they arrive rather than using a simple generator as your request iterator. Perhaps something like
class MySmarterRequestIterator(object):
def __init__(self):
self._lock = threading.Lock()
self._responses_so_far = []
def __iter__(self):
return self
def _next(self):
# some logic that depends upon what responses have been seen
# before returning the next request message
return <your message value>
def __next__(self): # Python 3
return self._next()
def next(self): # Python 2
return self._next()
def add_response(self, response):
with self._lock:
self._responses.append(response)
that you then use like
my_smarter_request_iterator = MySmarterRequestIterator()
responses = stub.MyEventStream(my_smarter_request_iterator)
for response in responses:
my_smarter_request_iterator.add_response(response)
. There will probably be locking and blocking in your _next implementation to handle the situation of gRPC Python asking your object for the next request that it wants to send and your responding (in effect) "wait, hold on, I don't know what request I want to send until after I've seen how the next response turned out".
Instead of writing a custom iterator, you can also use a blocking queue to implement send and receive like behaviour for client stub:
import queue
...
send_queue = queue.SimpleQueue() # or Queue if using Python before 3.7
my_event_stream = stub.MyEventStream(iter(send_queue.get, None))
# send
send_queue.push(StreamingMessage())
# receive
response = next(my_event_stream) # type: StreamingMessage
This makes use of the sentinel form of iter, which converts a regular function into an iterator that stops when it reaches a sentinel value (in this case None).
I am trying to do stress test on a server using Python 3. The idea is to send an HTTP request to the API server every 1 second for 30 minutes. I tried using requests and apscheduler to do this but I kept getting
Execution of job "send_request (trigger: interval[0:00:01], next run at: 2017-05-23 11:05:46 EDT)"
skipped: maximum number of running instances reached (1)
How can I make this work? Below is my code so far:
import requests, json, time, ipdb
from apscheduler.schedulers.blocking import BlockingScheduler as scheduler
def send_request():
url = 'http://api/url/'
# Username and password
credentials = { 'username': 'username', 'password': 'password'}
# Header
headers = { 'Content-Type': 'application/json', 'Client-Id': 'some string'}
# Defining payloads
payload = dict()
payload['item1'] = 1234
payload['item2'] = 'some string'
data_array = [{"id": "id1", "data": "some value"}]
payload['json_data_array'] = [{ "time": int(time.time()), "data": data_array]
# Posting data
try:
request = requests.post(url, headers = headers, data = json.dumps(payload))
except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as err:
print("Error while trying to POST pid data")
print(err)
finally:
request.close()
print(request.content)
return request.content
if __name__ == '__main__':
sched = scheduler()
print(time.time())
sched.add_job(send_request, 'interval', seconds=1)
sched.start()
print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))
try:
# This is here to simulate application activity (which keeps the main thread alive).
while true:
pass
except (KeyboardInterrupt, SystemExit):
# Not strictly necessary if daemonic mode is enabled but should be done if possible
scheduler.shutdown()
I tried searching on stack overflow but none of the other questions does what I want so far, or maybe I missed something. I would appreciate someone to point me to the correct thread if that is the case. Thank you very much!
I think your error is described well by the duplicate that I marked as well as the answer by #jeff
Edit: Apparently not.. so here I'll describe how to fix the maximum instances problem:
Maximum instances problem
When you're adding jobs to the scheduler there is an argument you can set for the number of maximum allowed concurrent instances of the job. You can should read about this here:
BaseScheduler.add_job()
So, fixing your problem is just a matter of setting this to something higher:
sch.add_job(myfn, 'interval', seconds=1, max_instances=10)
But, how many concurrent requests do you want? If they take more than one second to respond, and you request one per second, you will always eventually get an error if you let it run long enough...
Schedulers
There are several scheduler options available, here are two:
BackgroundScheduler
You're importing the blocking scheduler - which blocks when started. So, the rest of your code is not being executed until after the scheduler stops. If you need other code to be executed after starting the scheduler, I would use the background scheduler like this:
from apscheduler.schedulers.background import BackgroundScheduler as scheduler
def myfn():
# Insert your requests code here
print('Hello')
sch = scheduler()
sch.add_job(myfn, 'interval', seconds=5)
sch.start()
# This code will be executed after the sceduler has started
try:
print('Scheduler started, ctrl-c to exit!')
while 1:
# Notice here that if you use "pass" you create an unthrottled loop
# try uncommenting "pass" vs "input()" and watching your cpu usage.
# Another alternative would be to use a short sleep: time.sleep(.1)
#pass
#input()
except KeyboardInterrupt:
if sch.state:
sch.shutdown()
BlockingScheduler
If you don't need other code to be executed after starting the scheduler, you can use the blocking scheduler and it's even easier:
apscheduler.schedulers.blocking import BlockingScheduler as scheduler
def myfn():
# Insert your requests code here
print('Hello')
# Execute your code before starting the scheduler
print('Starting scheduler, ctrl-c to exit!')
sch = scheduler()
sch.add_job(myfn, 'interval', seconds=5)
sch.start()
I have never used the scheduler in python before, however this other stackOverflow question seems to deal with that.
It means that the task is taking longer than one second and by default only one concurrent execution is allowed for a given job... -Alex Grönholm
In your case I imagine using threading would meet your needs.
If you created a class that inherited threads in python, something like:
class Requester(threading.Thread):
def __init__(self, url, credentials, payload):
threading.Thread._init__(self)
self.url = url
self.credentials = credentials
self.payload = payload
def run(self):
# do the post request here
# you may want to write output (errors and content) to a file
# rather then just printing it out sometimes when using threads
# it gets really messing if you just print everything out
Then just like how you handle with a slight change.
if __name__ == '__main__':
url = 'http://api/url/'
# Username and password
credentials = { 'username': 'username', 'password': 'password'}
# Defining payloads
payload = dict()
payload['item1'] = 1234
payload['item2'] = 'some string'
data_array = [{"id": "id1", "data": "some value"}]
payload['json_data_array'] = [{ "time": int(time.time()), "data": data_array]
counter = 0
while counter < 1800:
req = Requester(url, credentials, payload)
req.start()
counter++
time.sleep(1)
And of course finish the rest of it however you would like to, if you want to you could make it so that the KeyboardInterrupt is what actually finishes the script.
This of course is a way to get around the scheduler, if that is what the issue is.
I'm trying to take a list of items and check for their status change based on certain processing by the API. The list will be manually populated and can vary in number to several thousand.
I'm trying to write a script that makes multiple simultaneous connections to the API to keep checking for the status change. For each item, once the status changes, the attempts to check must stop. Based on reading other posts on Stackoverflow (Specifically, What is the fastest way to send 100,000 HTTP requests in Python? ), I've come up with the following code. But the script always stops after processing the list once. What am I doing wrong?
One additional issue that I'm facing is that the keyboard interrup method never fires (I'm trying with Ctrl+C but it does not kill the script.
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
requestURLBase = "https://example.com/api"
apiKey = "123456"
concurrent = 200
keepTrying = 1
def doWork():
while keepTrying == 1:
url = q.get()
status, body, url = checkStatus(url)
checkResult(status, body, url)
q.task_done()
def checkStatus(ourl):
try:
url = urlparse(ourl)
conn = httplib.HTTPConnection(requestURLBase)
conn.request("GET", url.path)
res = conn.getresponse()
respBody = res.read()
conn.close()
return res.status, respBody, ourl #Status can be 210 for error or 300 for successful API response
except:
print "ErrorBlock"
print res.read()
conn.close()
return "error", "error", ourl
def checkResult(status, body, url):
if "unavailable" not in body:
print status, body, url
keepTrying = 1
else:
keepTrying = 0
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
except KeyboardInterrupt:
sys.exit(1)
I'm new to Python so there could be syntax errors as well... I'm definitely not familiar with multi-threading so perhaps I'm doing something else wrong as well.
In the code, the list is only read once. Should be something like
try:
while True:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
For the interrupt thing, remove the bare except line in checkStatus or make it except Exception. Bare excepts will catch all exceptions, including SystemExit which is what sys.exit raises and stop the python process from terminating.
If I may make a couple comments in general though.
Threading is not a good implementation for such large concurrencies
Creating a new connection every time is not efficient
What I would suggest is
Use gevent for asynchronous network I/O
Pre-allocate a queue of connections same size as concurrency number and have checkStatus grab a connection object when it needs to make a call. That way the connections stay alive, get reused and there is no overhead in creating and destroying them and the increased memory use that goes with it.
I have an api code snippet:
#app.route("/do_something", method=['POST', 'OPTIONS'])
#CORS is enabled
def initiate_trade():
'''
post json
some Args: *input
'''
if request.method == 'OPTIONS':
yield {}
else:
response.headers['Content-type'] = 'application/json'
data = (request.json)
print data
for dump in json.dumps(function(input)): yield dump
The corresponding function is:
def function(*input):
#========= All about processing foo input ==========#
....
#========= All about processing foo input ends ==========#
worker = []
for this in foo_data:
#doing something
for _ in xrange(this):
#doing smthng again
worker.append(gevent.spawn(foo_fi, args))
result = gevent.joinall(worker)
some_dict.update({this: [t.value for t in worker]})
gevent.killall(worker)
worker = []
yield {this:some_dict[this]}
#gevent.sleep(2)
When I run the DHC rest client, w/o the gevent.sleep(2), it gives everything as if a synchronous return value. BUT, with the gevent.sleep(2) uncommented, nothing gets back.
What's wrong?
I thought sleep will cause a delay and "dump" value will be streamed one by one as is available.
Also im no javascript guy but I can read the code somewhat. But even ajax wouldn't receive the code if the server code is not being returned. So I am assuming that negates any possibilities of client side code malfunction and has everything to do with this code snippet.
Please note that instead of yielding, if I just return the value as
def function(*input):
.
.
return some_dict
and on api side I do:
return json.dumps(function(input))
then everything works fine on the client side.
Edit 2
Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.
class WebkitDownloader( object ):
def __init__(self):
os.environ["DISPLAY"] = ":99"
self.proxyAddress = "a:b#" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
def process_response(self, request, response, spider):
self.request = request
self.response = response
if 'cached' not in response.flags:
webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
#print "added to queue: " + str(self.counter)
webkitBrowser.get(html=response.body, num_retries=0)
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
webkitBrowser.setPage(None)
del webkitBrowser
return response
Edit:
I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.
Can anyone please point me into right direction to make this work?
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.queue = list()
for i in range(8):
self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))
def process_response(self, request, response, spider):
i = 0
for webkitBrowser in self.queue:
i += 1
if webkitBrowser.status == "WAITING":
break
webkitBrowser = self.queue[i]
if webkitBrowser.status == "WAITING":
# load webpage
print "added to queue: " + str(i)
webkitBrowser.get(html=response.body, num_retries=0)
webkitBrowser.scrapyResponse = response
while webkitBrowser.status == "PROCESSING":
print "waiting for queue: " + str(i)
if webkitBrowser.status == "DONE":
print "fetched from queue: " + str(i)
#response = webkitBrowser.scrapyResponse
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
#response = response.replace(**kwargs)
webkitBrowser.status = "WAITING"
return response
I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).
I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)
How would I achieve that with Python? This is my current middleware:
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
def process_response(self, request, response, spider):
if not ".pdf" in response.url:
# load webpage
self.w.get(html=response.body, num_retries=0)
html = self.w.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
return response
I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?
I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.
Here's a recent SO Q/A with example code that might help.
Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.
I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:
If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.