I need a way to manage connections to a hosted Elastic Search provider, to speed up search on my website. We are running Django on Heroku, using the Found ElasticSearch add-on, and pyes, which is an ElasticSearch Python library.
The standard way of setting up a connection to ElasticSearch with pyes is by passing the provider URL into an ES object, like so:
(1) connection = ES(my_elasticsearch_url)
Pyes uses the ES object behind the scenes to establish an open HTTP connection to my ElasticSearch provider, so I can run searches like this:
(2) results = connection.search(some_query, index_name)
Previously, I was doing both those steps in my Django view for search -- every time a user did a search, it opened a new HTTP connection then ran the search. Consequentially the search call was slow.
I sped up search by moving (1) into my app's __init__.py file -- now, I am setting up the connection only once, and importing it into the search view. But I'm worried it will choke that HTTP connection if lots of people are trying search at once.
I'm looking for ideas on how to set up a pool of connections, initiate them once on app start up, and then dole them out to my search view as needed. Ideally I'd like to be able to scale the size of the pool up and down easily with minimal changes to my code.
I can think of a few ways to approach it, but it seems like a common computing related problem, so I'm sure that a lot of you have ideas on good design and best practices for such a system. I'd love to hear them.
Thanks a lot!
Clay
If your running in a multi-threaded environment, it's merely a matter of extending Queue.Queue to create an instance that can fetch and instantiate connections on demand, from multiple threads in which your views are handling the request-response flow. You'll probably want to have a certain cap on how many connections your retaining by limiting the maximum size of the queue, although you can instantiate more connections beyond that and simply discard them if you can put them back into the queue.
The downside of using Queue.Queue is that it can create cross-cutting concerns if your views are responsible for retrieving connections from and returning them back into the queue. You can get a healthier design if you only queue the actual object from pyes.ES that holds the connection and create a wrapper for ES that, when performing a query, creates a new ES instance, fetches a connection from the queue, sets it on the instance, performs the query, returns the connection back into the queue, discards the ES instance and returns the query results.
Related
I am working on scaling out a webapp and providing some database redundancy for protection against failures and to keep the servers up when updates are needed. The app is still in development, so I have chosen a simple multi-master redundancy with two separate database servers to try and achieve this. Each server will have the Django code and host its own database, and the databases should be as closely mirrored as possible (updated within a few seconds).
I am trying to figure out how to set up the multi-master (master-master) replication between databases with Django and MySQL. There is a lot of documentation about setting it up with MySQL only (using various configurations), but I cannot find any for making this work from the Django side of things.
From what I understand, I need to approach this by adding two database entries in the Django settings (one for each master) and then write a database router that will specify which database to read from and which to write from. In this scenario, both databases should accept both reads and writes, and writes/updates should be mirrored over to the other database. The logic in the router could simply use a round-robin technique to decide which database to use. From there on, further configuration to set up the actual replication should be done through MySQL configuration.
Does this approach sound correct, and does anyone have any experience with getting this to work?
Your idea of the router is great! I would add that you need automatically detect whether a databases is [slow] down. You can detect that by the response time and by connection/read/write errors. And if this happens then you exclude this database from your round-robin list for a while, trying to connect back to it every now and then to detect if the databases is alive.
In other words the round-robin list grows and shrinks dynamically depending on the health status of your database machines.
The another important notice is that luckily you don't need to maintain this round-robin list common to all the web servers. Each web server can store its own copy of the round-robin list and its own state of inclusion and exclusion of databases into this list. This is just because a database server can be seen from one web server and can be not seen from another one due to local network problems.
We are using multiple cassandra datastax cluster instances(6) to connect to cassandra using python. We are pooling these multiple connections to do some operations. Each operation is independent of other.
It works fine on a small number of operations, but once I try to scale up I get the following errors :
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.ption('Pool is shutdown',)})
and sometimes the following warning:
WARNING Heartbeat failed for connection (140414695068880) to 127.0.0.1
I tried changing some cluster object parameters but it did not help.
Following is the configuration of key space in cassandra I am using :
'class': 'SimpleStrategy',
'replication_factor': '1'
I am using lastest versions of cassandra and datastax driver for python. There is only one node is cassandra.
EDIT: More details:
The multiple cluster instances are in different processes (processes are created using the Python multiprocessing module) - one cluster instance per process. Lets call the proccesses Cassandra-Processes (CP). There are a bunch of other process that do some computation and need to look up a Cassandra DB, and write to it, occassionally. The current design is that each of these processes is mapped to one CP, and all DB reads/writes to be done by the process is done via this mapped CP. 'what' exactly is to be read/written is passed into a queue (again from the multiprocessing library) which the mapped CP reads.
We observe that this setup runs for quite sometime - and then suddenly Cassandra begins erroring out.
It's unclear why you're using six cluster instances against a single Cassandra node. Generally, you should use one Cluster instance per application (per remote cluster). You can read about general design considerations for Cassandra drivers here
If you're looking to "scale" with regards to throughput, you might consider using multiprocessing. I discuss this in a blog post here.
Follow-on:
Two things can be inferred from the information we have so far:
The application is pushing more concurrent requests than your connection pool is configured to handle. I say this because the "Pool is shutdown" only occurs when a request is waiting for a connection/stream to become available. You can tune connection pooling to make more available initially using cluster settings. However, if your "cluster" (server node) is overwhelmed, you won't gain much there.
Your connection is being shutdown. This exception only happens when the node is suddenly marked down. In a single node setup this is most likely because of a connection error. Look for clues in the server log, or driver debug log if you're capturing that.
We probably need to know more about your execution model to help more. Is it possible you're running unfettered async requests without occasionally waiting for them to complete?
Remote diagnosis is hard to do without knowing anything on your specific topology, setup and system configuration. This however looks much like a configuration problem or even the python driver. If you google your error message you will find multiple topics on Datastax's Jira describing this or similar problems, I would check that the Python Driver is up to date.
What would help in the first place is to see in detail what you try to do, how your cluster is configured aso.
I'm working on a web application related to genome searching. This application makes use of this suffix tree library through Cython bindings. Objects of this type are large (hundreds of MB up to ~10GB) and take as long to load from disk as it takes to process them in response to a page request. I'm looking for a way to load several of these objects once on server boot and then use them for all page requests.
I have tried using a remote manager / client setup using the multiprocessing module, modeled after this demo, but it fails when the client connects with an error message that says the object is not picklable.
I would suggest writing a small Flask (or even raw WSGI… But it's probably simpler to use Flask, as it will be easier to get up and running quickly) application which loads the genome database then exposes a simple API. Something like this:
app = Flask(__name__)
database = load_database()
#app.route('/get_genomes')
def get_genomes():
return database.all_genomes()
app.run(debug=True)
Or, you know, something a bit more sensible.
Also, if you need to be handling more than one request at a time (I believe that app.run will only handle one at a time), start by threading… And if that's too slow, you can os.fork() after the database is loaded and run multiple request handlers from there (that way they will all share the same database in memory).
I'm just wondering if Django was designed to be a fully stateless framework?
It seems to encourage statelessness and external storage mechanisms (databases and caches) but I'm wondering if it is possible to store some things in the server's memory while my app is in develpoment and runs via manage.py runserver.
Sure it's possible. But if you are writing a web application you probably won't want to do that because of threading issues.
That depends on what you mean by "store things in the server's memory." It also depends on the type of data. If you can, you're better off storing "global data" in a database or in the file system somewhere. Unless it is needed every request it doesn't really make sense to store it in the Django instance itself. You'll need to implement some form of locking to prevent race conditions, but you'd need to worry about race conditions if you stored everything on the server object anyway.
Of course, if you're talking about user-by-user data, Django does support sessions. Or, and this is another perfectly good option if you're willing to make the user save the data, cookies.
The best way to maintain state in a django app on a per-user basis is request.session (see django sessions) which is a dictionary you can use to remember things about the current user.
For Application-wide state you should use the a persistent datastore (database or key/value store)
example view for sessions:
def my_view(request):
pages_viewed = request.session.get('pages_viewed', 1) + 1
request.session['pages_viewed'] = pages_viewed
...
If you wanted to maintain local variables on a per app-instance basis you can just define module level variables like so
# number of times my_view has been served since by this server
# instance since the last restart
served_since_restart = 0
def my_view(request):
served_since_restart += 1
...
If you wanted to maintain some server state across ALL app servers (like total number of pages viewed EVER) you should probably use a persistent key/value store like redis, memcachedb, or riak. There is a decent comparison of all these options here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
You can do it with redis (via redis-py) like so (assuming your redis server is at "127.0.0.1" (localhost) and it's port 6379 (the default):
import redis
def my_view(request):
r = redis.Redis(host='127.0.0.1', port="6379")
served = r.get('pages_served_all_time', 0)
served += 1
r.set('pages_served_all_time', served)
...
There is LocMemCache cache backend that stores data in-process. You can use it with sessions (but with great care: this cache is not cross-process so you will have to use single process for deployment because it will not be guaranteed that subsequent requests will be handled by the same process otherwise). Global variables may also work (use threadlocals if they shouldn't be shared for all process threads; the warning about cross-process communication also applies here).
By the way, what's wrong with external storage? External storage provides easy cross-process data sharing and other features (like memory limiting algorithms for cache or persistance with databases).
I am using a Python module (PyCLIPS) and Django 1.3.
I want develop a thread-safety class which realizes the Object Pool and the Singleton patterns and also that have to be shared between requests in Django.
For example, I want to do the following:
A request gets the object with some ID from the pool, do
something with it and push it back to the pool, then send response
with the object's ID.
Another request, that has the object's ID, gets
the object with the given ID from the pool and repeats the steps from the above request.
But the state of the object will has to be kept while it'll be at the pool while the server is running.
It should be like a Singleton Session Bean in Java EE
How I should do it? Is there something I'll should read?
Update:
I can't store objects from the pool in a database, because these objects are wrappers under a library written on C-language which is API for the Expert System Engine CLIPS.
Thanks!
Well, I think a different angle is necessary here. Django is not like Java, the solution should be tailored for a multi-process environment, not a multi-threaded one.
Django has no immediate equivalent of a singleton session bean.
That said, I see no reason your description does not fit a classic database model. You want to save per object data, which should always go in the DB layer.
Otherwise, you can always save stuff on the session, which Django provides for both logged-in users as well as for anonymous ones - see the docs on Django sessions.
Usage of any other pattern you might be familiar with from a Java environment will ultimately fail, considering the vast difference between running a Java web container, and the Python/Django multi-process environment.
Edit: well, considering these objects are not native to your app rather accessed via a third-party library, it does complicate things. My gut feeling is that these objects should not be handled by the web layer but rather by some sort of external service which you can access from a multi-process environment. As Daniel mentioned, you can always throw them in the cache (if said objects are pickle-able). But it feels as if these objects do not belong in the web tier.
Assuming the object cannot be pickled, you will need to create an app to manage the object and all of the interactions that need to happen against it. Probably the easiest implementation would be to create a single process wsgi app (on a different port) that exposes an api to do all of the operations that you need. Whether you use a RESTful api or form posts is up to your personal preference.
Are these database objects? Because if so, the db itself is really the pool, and there's no need to do anything special - each request can independently load the instance from the db, modify it, and save it back.
Edit after comment Well, the biggest problem is that a production web server environment is likely to be multi-process, so any global variables (ie the pool) are not shared between processes. You will need to store them somewhere that's globally accessible. A short in the dark, but are they serializable using Pickle? If so, then perhaps memcache might work.