Non-Blocking WebSocketHandler while receiving jobs from a queue

Non-Blocking WebSocketHandler while receiving jobs from a queue - python

Setup:
Tornado HTTP/WebSocket Server. WebSocketHandler reacts on messages from the client (e.g. put them in the job-queue)
A beanstalk job-queue which sends jobs to the different components
Some other components communicating over beanstalk, but those are unrelated to my problem.
Problem:
WebSocketHandler should react on jobs, but if he is listening on beanstalk, its blocking. A job could be e.g. 'send data xy to client xyz'
How can this be solved nicely?
My first approach was running a jobqueue-listener in a separate thread which contained a list of the pickled WebSocketHandler. All should be stored in a redis-db. Since WebsocketHandler can't be pickled (and this approach seems to be very ugly) I'm searching for another solution.
Any ideas?

Instead of trying to pickle your WebSocketHandler instances you could store them in a class level (or just global) dictionary.
class MyHandler(WebSocketHandler):
connections = {}
def __init__(self, *args, **kwargs):
self.key = str(self)
self.connections[self.key] = self
Then you would pass the self.key along with the job to beanstalk, and when you get a job back you look up which connection to send the output to with the key, and then write to it. Something like (pseudo code...)
def beanstalk_listener():
for response in beanstalk.listen():
MyHandler.connections[response.data[:10]].write_message(response[10:])
I don't think there is any value in trying to persist your websockethandler connections in redis. They are by nature ephemeral. If your tornado process restarts/dies they have no use. If what you are trying to do is keep a record of which user is waiting for the output of which job then you'll need to keep track of that separately.

Related

Redis Subscriber/Publisher with Python and Node.js

I have a basic Web API written in Node.js that writes an object as an HSET to a Redis cache. Both are running in docker containers.
I have a Python script running on the same VM which needs to watch the Redis cache and then run some code when there is a new HSET or a change to a field in the HSET.
I came across Redis Pub/Sub but I'm not sure if this is really the proper way to use it.
To test, I created two Python scripts. The first subscribes to the messaging system:
import redis
import json
print ("Redis Subscriber")
redis_conn = redis.Redis(
host='localhost',
port=6379,
password='xxx',
charset="utf-8",
decode_responses=True)
def sub():
pubsub = redis_conn.pubsub()
pubsub.subscribe("broadcast")
for message in pubsub.listen():
if message.get("type") == "message":
data = json.loads(message.get("data"))
print(data)
if __name__ == "__main__":
sub()
The second publishes to the messaging system:
import redis
import json
print ("Redis Publisher")
redis_conn = redis.Redis(
host='localhost',
port=6379,
password='xxx',
charset="utf-8",
decode_responses=True)
def pub():
data = {
"message": "id:3"
}
redis_conn.publish("broadcast", json.dumps(data))
if __name__ == "__main__":
pub()
I will rewrite the publisher in Node.js and it will simply published the HSET key, like id:3. Then the subscriber will run in Python and when it received a new message, it will use that HSET key "id:3" to look up the actual HSET and do stuff.
This doesn't seem like the right way to do this but Redis watch doesn't support HSET. Is there a better way to accomplish this?

This doesn't seem like the right way to do this but Redis watch doesn't support HSET.
Redis WATCH does support hash keys - while it does not support hash fields.
Is there a better way to accomplish this?
While I believe your approach may be acceptable for certain scenarios, pub/sub messages are fire-and-forget: your subscriber may disconnect for whatever reason right after the publisher has published a message but before having the chance to read it - and your object write will thus be lost forever, even if the subscriber automatically reconnects after that.
You may opt instead for Redis streams, which allow to add entries to a given stream (resembling the publishing process of your code) and consume them (akin your subscriber script), through a process which preserves the messages.
As an alternative, perhaps simpler, approach, you may just split your hashes into multiple keys, one per field, so that you can WATCH them.

You might want to take a look at key-space notifications. Key-space notifcations can automatically publish messages when via PubSub when a key is changed, added, deleted, etc.
You can choose to consume events, i.e. HSET was called, and be provided the keyname it was called upon. Or, you can choose to consume keys, i.e my:awesome:key, and be notified with what event happened. Or both.
You'll need to turn key-space notifications on in order to use them:
redis.cloud:6379> CONFIG SET notify-keyspace-events KEA
You can subscribe to all events and keys like this:
redis.cloud:6379> PSUBSCRIBE '__key*__:*'
"pmessage","__key*__:*","__keyspace#0__:foo","set"
"pmessage","__key*__:*","__keyevent#0__:set","foo"
Hope that helps!

Problem with a shared object when multiple Users are requesting my Flask Web-App [duplicate]

In my application, the state of a common object is changed by making requests, and the response depends on the state.
class SomeObj():
def __init__(self, param):
self.param = param
def query(self):
self.param += 1
return self.param
global_obj = SomeObj(0)
#app.route('/')
def home():
flash(global_obj.query())
render_template('index.html')
If I run this on my development server, I expect to get 1, 2, 3 and so on. If requests are made from 100 different clients simultaneously, can something go wrong? The expected result would be that the 100 different clients each see a unique number from 1 to 100. Or will something like this happen:
Client 1 queries. self.param is incremented by 1.
Before the return statement can be executed, the thread switches over to client 2. self.param is incremented again.
The thread switches back to client 1, and the client is returned the number 2, say.
Now the thread moves to client 2 and returns him/her the number 3.
Since there were only two clients, the expected results were 1 and 2, not 2 and 3. A number was skipped.
Will this actually happen as I scale up my application? What alternatives to a global variable should I look at?

You can't use global variables to hold this sort of data. Not only is it not thread safe, it's not process safe, and WSGI servers in production spawn multiple processes. Not only would your counts be wrong if you were using threads to handle requests, they would also vary depending on which process handled the request.
Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs. If you need to load and access Python data, consider multiprocessing.Manager. You could also use the session for simple data that is per-user.
The development server may run in single thread and process. You won't see the behavior you describe since each request will be handled synchronously. Enable threads or processes and you will see it. app.run(threaded=True) or app.run(processes=10). (In 1.0 the server is threaded by default.)
Some WSGI servers may support gevent or another async worker. Global variables are still not thread safe because there's still no protection against most race conditions. You can still have a scenario where one worker gets a value, yields, another modifies it, yields, then the first worker also modifies it.
If you need to store some global data during a request, you may use Flask's g object. Another common case is some top-level object that manages database connections. The distinction for this type of "global" is that it's unique to each request, not used between requests, and there's something managing the set up and teardown of the resource.

This is not really an answer to thread safety of globals.
But I think it is important to mention sessions here.
You are looking for a way to store client-specific data. Every connection should have access to its own pool of data, in a threadsafe way.
This is possible with server-side sessions, and they are available in a very neat flask plugin: https://pythonhosted.org/Flask-Session/
If you set up sessions, a session variable is available in all your routes and it behaves like a dictionary. The data stored in this dictionary is individual for each connecting client.
Here is a short demo:
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
# Check Configuration section for more details
SESSION_TYPE = 'filesystem'
app.config.from_object(__name__)
Session(app)
#app.route('/')
def reset():
session["counter"]=0
return "counter was reset"
#app.route('/inc')
def routeA():
if not "counter" in session:
session["counter"]=0
session["counter"]+=1
return "counter is {}".format(session["counter"])
#app.route('/dec')
def routeB():
if not "counter" in session:
session["counter"] = 0
session["counter"] -= 1
return "counter is {}".format(session["counter"])
if __name__ == '__main__':
app.run()
After pip install Flask-Session, you should be able to run this. Try accessing it from different browsers, you'll see that the counter is not shared between them.

Another example of a data source external to requests is a cache, such as what's provided by Flask-Caching or another extension.
Create a file common.py and place in it the following:
from flask_caching import Cache
# Instantiate the cache
cache = Cache()
In the file where your flask app is created, register your cache with the following code:
# Import cache
from common import cache
# ...
app = Flask(__name__)
cache.init_app(app=app, config={"CACHE_TYPE": "filesystem",'CACHE_DIR': Path('/tmp')})
Now use throughout your application by importing the cache and executing as follows:
# Import cache
from common import cache
# store a value
cache.set("my_value", 1_000_000)
# Get a value
my_value = cache.get("my_value")

While totally accepting the previous upvoted answers, and discouraging use of global variables for production and scalable Flask storage, for the purpose of prototyping or really simple servers, running under the flask 'development server'...
...
The Python built-in data types, and I personally used and tested the global dict, as per Python documentation are thread safe. Not process safe.
The insertions, lookups, and reads from such a (server global) dict will be OK from each (possibly concurrent) Flask session running under the development server.
When such a global dict is keyed with a unique Flask session key, it can be rather useful for server-side storage of session specific data otherwise not fitting into the cookie (max size 4 kB).
Of course, such a server global dict should be carefully guarded for growing too large, being in-memory. Some sort of expiring the 'old' key/value pairs can be coded during request processing.
Again, it is not recommended for production or scalable deployments, but it is possibly OK for local task-oriented servers where a separate database is too much for the given task.
...

Is there a less clunky way to interact with an AWS worker tier?

I have an Elastic Beanstalk application which is running a web server environment and a worker tier environment. My goal is to pass some parameters to an endpoint in the web server which submits a request to the worker which will then go off and do a long computation and write the results to an S3 bucket. For now I'm ignoring the "long computation" part and just writing a little hello world application which simulates the workflow. Here's my Flask application:
from flask import Flask, request
import boto3
import json
application = Flask(__name__)
#application.route("/web")
def test():
data = json.dumps({"file": request.args["file"], "message": request.args["message"]})
boto3.client("sqs").send_message(
QueueUrl = "really_really_long_url_for_the_workers_sqs_queue",
MessageBody = data)
return data
#application.route("/worker", methods = ["POST"])
def worker():
data = request.get_json()
boto3.resource("s3").Bucket("myBucket").put_object(Key = data["file"], Body = data["message"])
return data["message"]
if __name__ == "__main__":
application.run(debug = True)
(Note that I changed the worker's HTTP Path from the default / to /worker.) I deployed this application to both the web server and to the worker, and it does exactly what I expected. Of course, I had to do the usual IAMS configuration.
What I don't like about this is the fact that I have to hard code my worker's SQS URL into my web server code. This makes it more complicated to change which queue the worker polls, and more complicated to add additional workers, both of which will be convenient in production. I would like some code which says "send this message to whatever queue worker X is currently polling". It's obviously not a huge deal, but I thought I would see if anyone knows a way to do this.

Given the nature of the queue URLs, you may want to try keeping them in some external storage (an in-memory database or key-value store, perhaps) that associates the URLs with the IDs of the workers currently using them. That way you can update them as need be without having to modify your application. (The downside would be that you then have [an] additional source[s] of data to maintain and you'd need to write the interfacing code for both the server and workers.)

Persistence of variables in a service using Perspective Broker in Python Twisted?

I'm having a problem using the Perspective Broker feature of Twisted Python. The structure of my code is like this:
class DBService(service.Service):
def databaseOperation(self, input):
# insert input into DB
class PerspectiveRoot(pb.Root):
def __init__(self, service):
self.service = service
def remote_databaseOperation(self, input):
return self.service.databaseOperation(input)
db_service = DBService()
pb_factory = pb.PBServerFactory(PerspectiveRoot(db_service))
I hook up the factory to a TCP server and then multiple clients connect, who are able to insert records into the database via the remote_databaseOperation function.
This works fine until the number of requests gets big, then I end up with duplicate inputs and missing inputs. I assume this is because DBService's 'input' variable persists and gets overwritten during simultaneous requests. Is this correct? And, if so, what's the best way to rewrite my code so it can deal with simultaneous requests?
My first thought is to have DBService maintain a list of DB additions and loop through it, while clients are able to append to the list. Is this the most 'Twisted' way to do it?
Alternatively, is there a separate pb.Root instance for every client? In which case, I could move the database operation into there since its variables won't get overwritten.

Python, Twisted, Django, reactor.run() causing problem

I have a Django web application. I also have a spell server written using twisted running on the same machine having django (running on localhost:8090). The idea being when user does some action, request comes to Django which in turn connects to this twisted server & server sends data back to Django. Finally Django puts this data in some html template & serves it back to the user.
Here's where I am having a problem. In my Django app, when the request comes in I create a simple twisted client to connect to the locally run twisted server.
...
factory = Spell_Factory(query)
reactor.connectTCP(AS_SERVER_HOST, AS_SERVER_PORT, factory)
reactor.run(installSignalHandlers=0)
print factory.results
...
The reactor.run() is causing a problem. Since it's an event loop. The next time this same code is executed by Django, I am unable to connect to the server. How does one handle this?

The above two answers are correct. However, considering that you've already implemented a spelling server then run it as one. You can start by running it on the same machine as a separate process - at localhost:PORT. Right now it seems you have a very simple binary protocol interface already - you can implement an equally simple Python client using the standard lib's socket interface in blocking mode.
However, I suggest playing around with twisted.web and expose a simple web interface. You can use JSON to serialize and deserialize data - which is well supported by Django. Here's a very quick example:
import json
from twisted.web import server, resource
from twisted.python import log
class Root(resource.Resource):
def getChild(self, path, request):
# represents / on your web interface
return self
class WebInterface(resource.Resource):
isLeaf = True
def render_GET(self, request):
log.msg('GOT a GET request.')
# read request.args if you need to process query args
# ... call some internal service and get output ...
return json.dumps(output)
class SpellingSite(server.Site):
def __init__(self, *args, **kwargs):
self.root = Root()
server.Site.__init__(self, self.root, **kwargs)
self.root.putChild('spell', WebInterface())
And to run it you can use the following skeleton .tac file:
from twisted.application import service, internet
site = SpellingSite()
application = service.Application('WebSpell')
# attach the service to its parent application
service_collection = service.IServiceCollection(application)
internet.TCPServer(PORT, site).setServiceParent(service_collection)
Running your service as another first class service allows you to run it on another machine one day if you find the need - exposing a web interface makes it easy to horizontally scale it behind a reverse proxying load balancer too.

reactor.run() should be called only once in your whole program. Don't think of it as "start this one request I have", think of it as "start all of Twisted".
Running the reactor in a background thread is one way to get around this; then your django application can use blockingCallFromThread in your Django application and use a Twisted API as you would any blocking API. You will need a little bit of cooperation from your WSGI container, though, because you will need to make sure that this background Twisted thread is started and stopped at appropriate times (when your interpreter is initialized and torn down, respectively).
You could also use Twisted as your WSGI container, and then you don't need to start or stop anything special; blockingCallFromThread will just work immediately. See the command-line help for twistd web --wsgi.

You should stop reactor after you got results from Twisted server or some error/timeout happening. So on each Django request that requires query your Twisted server you should run reactor and then stop it. But, it's not supported by Twisted library — reactor is not restartable. Possible solutions:
Use separate thread for Twisted reactor, but you will need to deploy your django app with server, which has support for long running threads (I don't now any of these, but you can write your own easily :-)).
Don't use Twisted for implementing client protocol, just use plain stdlib's socket module.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.