The code below is an HTTP proxy for content filtering. It uses GET to send the URL of the current site to the server, where it processes it and responds. It runs VERY, VERY, VERY slow. Any ideas on how to make it faster?
Here is the code:
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
from Tkinter import *
#import win32api
import urllib2
import urllib
import os
import webbrowser
cwd = os.path.abspath(sys.argv[0])[0]
proxies = {}
user = "zachb"
class BlockingProxyRequest(ProxyRequest):
def process(self):
params = {}
params['Location']= self.uri
params['User'] = user
params = urllib.urlencode(params)
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
if resp == "allow":
pass
else:
self.transport.write('''BLOCKED BY ADMIN!''')
self.transport.loseConnection()
ProxyRequest.process(self)
class BlockingProxy(Proxy):
requestFactory = BlockingProxyRequest
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
reactor.listenTCP(8000, factory)
reactor.run()
Anyone have any ideas on how to make this run faster? Or even a better way to write it?
The main cause of slowness in this proxy is probably these three lines:
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
A normal Twisted-based application is single threaded. You have to go out of your way to get threads involved. That means that whenever a request comes in, you are blocking the one and only processing thread on this HTTP request. No further requests are processed until this HTTP request completes.
Try using one of the APIs in twisted.web.client, (eg Agent or getPage). These APIs don't block, so your server will handle concurrent requests concurrently. This should translate into much smaller response times.
Related
I'm setting up a small Python service to act as an REST API reverse proxy, but hoping there's some libraries available to help speed this process up.
Need to be able to run a function to calculate a variable to inject as a request header when the request is proxied through to the backend.
As it stands I have a simpler script to do the function to get the variable and inject it into a Nginx config file and then force a Nginx hot reload via signals, but trying to remove this dependency for what should be a fairly simple task.
Would a good approach be to use falcon as the listener and combine it with another approach to inject and forward requests?
Thanks for reading.
Edit: Been reading https://aiohttp.readthedocs.io/en/stable/ as it seems to be the right direction.
Thanks to someone over at falcon, this is now the accepted answer!
import io
import falcon
import requests
class Proxy(object):
UPSTREAM = 'https://httpbin.org'
def __init__(self):
self.session = requests.Session()
def handle(self, req, resp):
headers = dict(req.headers, Via='Falcon')
for name in ('HOST', 'CONNECTION', 'REFERER'):
headers.pop(name, None)
request = requests.Request(req.method, self.UPSTREAM + req.path,
data=req.bounded_stream.read(),
headers=headers)
prepared = request.prepare()
from_upstream = self.session.send(prepared, stream=True)
resp.content_type = from_upstream.headers.get('Content-Type',
falcon.MEDIA_HTML)
resp.status = falcon.get_http_status(from_upstream.status_code)
resp.stream = from_upstream.iter_content(io.DEFAULT_BUFFER_SIZE)
api = falcon.API()
api.add_sink(Proxy().handle)
I have a tornado server listening to port 6789 for POST requests on "/train" and "/predict". train method might take upto 3 hrs to complete, while predict might return in 2 minutes. I want them to be handled concurrently. So even when "/train" is running, if a POST request for "/predict" arrives, it can handle that concurrently and return its output without waiting for "/train" to complete.
I have tried using ThreadPool but it still doesn't run concurrently.
My present code is as follows. It functions but if request to train is made and then request to predict is made. It waits for train to complete before handling predict. Assume train and predict functions are present and don't take any parameters.
import logging
import time
import threading
from multiprocessing.pool import ThreadPool
import flask
from tornado import wsgi, httpserver, ioloop
from flask import Flask
from train_script import train
from predict_script import predict
app = Flask(__name__)
#app.route("/train", methods=['POST'])
def train_run():
payload = flask.request.get_json(silent=True)
if payload is not None:
try:
async_result = pool.apply_async(train)
response = async_result.get()
resp = flask.jsonify(response)
resp.status_code = 200
except Exception as ex:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
else:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
return resp
#app.route("/predict", methods=['POST'])
def predict_run():
payload = flask.request.get_json(silent=True)
if payload is not None:
try:
async_result = pool.apply_async(predict)
response = async_result.get()
resp = flask.jsonify(response)
resp.status_code = 200
except Exception as ex:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
else:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
return resp
if __name__ == "__main__":
port = 6789
http_server = httpserver.HTTPServer(wsgi.WSGIContainer(app))
pool = ThreadPool(processes=10)# Expects max concurrent requests to be 10
http_server.listen(port)
logging.info("Tornado server starting on port {}".format(port))
ioloop.IOLoop.instance().start()
Tornado's WSGIContainer does not support any kind of concurrency. Either use Tornado's RequestHandler interfaces without Flask or WSGI, or use Flask with gunicorn or uwsgi. You gain almost nothing and lose a lot by combining Tornado with WSGI frameworks, so this is only useful in certain specialized situations.
I use python(2.7.6),gevnet(1.1.2),requests(2.11.1) to make http requests concurrent, and it works good. But when I add socks proxy to requests, it blocks.
This is my code:
import time
import requests
import logging
import click
import gevent
from gevent import monkey
monkey.patch_all()
FORMAT = '%(asctime)-15s %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger('test')
#socks proxy
user = MY_SOCKS_PROXY_USERNAME
password = MY_SOCKS_PROXY_PASSWORD
host = MY_SOCKS_PROXY_HOST
port = MY_SOCKS_PROXY_PORT
proxies = {
'http': 'socks5://{0}:{1}#{2}:{3}'.format(user, password, host, port),
'https': 'socks5://{0}:{1}#{2}:{3}'.format(user, password, host, port),
}
url = 'https://www.youtube.com/user/NBA'
def fetch_url(i,with_proxy):
while True:
logger.warning('thread %s fetch url'%i)
try:
if with_proxy:
res = requests.get(url,proxies=proxies, timeout=5)
else:
res = requests.get(url, timeout=5)
except Exception as e:
logger.error(str(e))
continue
logger.warning(res.status_code)
def do_other_thing():
while True:
logger.warning('do other thing...')
time.sleep(1)
#click.command()
#click.option('--with_proxy/--without_proxy',help='if use proxy', default=True)
def run(with_proxy):
if with_proxy:
logger.warning('with proxy......')
else:
logger.warning('without proxy......')
ts = []
ts.append(gevent.spawn(do_other_thing))
for i in xrange(3):
ts.append(gevent.spawn(fetch_url,i,with_proxy))
gevent.joinall(ts)
if __name__=='__main__':
run()
These pictures show the result.
run with proxy
run without proxy
With proxy, do_other_thing will blocks before fetch_url done.
Without proxy, it works good. (timeout error occurs, because of the GFW)
Can anyone help me solving this problem? Thanks very much!
I also ask this question at github. The collaborator is very nice and help me solve this problem. The resolvent is very very.
moving the gevent import and monkeypatch to the very top of the file before doing anything else
I also have a project with multiple files , and moving the import and monkeypatch to the very top of the first file solves my problem.
The issues at github.
I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.
I was trying to send out http post requests in parallel to a web service. To be specific, I'd like to do load test the web service under concurrent requests environment in terms of response time. So I plan to use thread and urllib2 to achieve the job - each thread carries out a http request by urllib2.
Here is how I did it:
import urllib2 as u2
import time
from threading import Thread
def run_job():
try:
req = u2.Request(
url = **the web service url**,
data = **data send to the web service**,
headers = **http header web service requires**,
)
opener = u2.build_opener()
u2.install_opener(opener)
start_time = time.time()
response = u2.urlopen(req, timeout = 60)
end_time = time.time()
html = response.read()
code = response.code
response.close()
if code == 200:
print end_time - start_time
else:
print -1
except Exception, e:
print -2
if __name__ == "__main__":
N = 1
if len(sys.argv) > 1:
N = int(sys.argv[1])
threads = []
for i in range(1, N):
t = Thread(target=run_job, args=())
threads.append(t)
[x.start() for x in threads]
[x.join() for x in threads]
In the meantime, I use fidller2 to capture the requests sending out. Fiddler is a tool to compose http request and send it out, and it can also capture the http requests through the host.
When I looked at the fidller2, those requests are sending out one by one instead of sending out all together at a time, which is what I expect to happen. If what represented in fiddler is right (requests are sending out one by one), to my knowledge, I think there must be some queue that the requests are waiting in. Could someone shed some light what happened behind this? And if it possible, how to do the real parallel requests?
Also, I have put two time stamps before and after the request takes place, if requests are waiting in queue after urllib2.urlopen is executed, then the delta of two time stamps includes the time spending in waiting queue. Is it possible to be more precise - to measure the time between request sending out and received?
Many Thanks,