How to properly handle database connections? - python

I would like to optimize my system, to be able to handle large amount of users down the road. Even if website never gets to be popular, I want to do things right.
Anyway, I am currently using a combo of 2 database solutions:
1.) Either SQL (mysql, postgre) via SQLAlchemy OR MongoDB
2.) Redis
I use Redis as 'hot' database (as its much much faster and unloads stress on primary database solution), and than sync data between two via cron tasks. I use Redis for session management, statistics etc. However, if my Redis server would crash, site would remain operational (fallback to sql/mongo).
So this is my design for data. Now I would like to do proper connecting.
As both sql/mongo and redis are required on 99% of pages, my current design is the following:
- When new HTTP request comes in, I connect to all databases
- When page finishes rendering, I disconnect from databases
Now obviously I am doing a lot of connecting/disconnecting. I've calculated that this model could sustain a decent amount of visitors, however I am wondering if there is a better way to do this.
Would persisting connections between requests improve performance/load or would the sheer amount of open connections clog the server?
Would you recommend creating a connection pool? If so, when should the connection pool be created and how should the Model access it (or fetch connection objects from it).
I am sorry if these questions are stupid, but I am a newbie.

I don't think that it is a good way to optimize things beforehand. You don't know where bottlenecks will apear and you are probably just wasting time for things you won't need in future mostly.
Database type can be changed later if you will use ORM, so right now you can use any. Anyway if your site popularity will raise high, you will need to get more servers, add some task queues (celery) etc. There is ton of things you can do later to optimize. Right now you should just focus on making your site popular and use technologies that can scale in future.

If you are going to leave connections open, you should definitely consider pooling to avoid crudding up the system with per-session connections or something of the like (as long as they are locked properly to avoid leaking). That said, the necessity of doing this isn't clear. If you can quantify the system with some average/worst-case connection times to the databases, you'd be able to make a much more informed decision.
Try running a script(s) to hammer your system and investigate DB related timing. This should help you make an immediate decision about whether to keep persistent connections and a handy DB load script for later on.

Related

Is is possible to cherry-pick the db connection (persistent) django will use on each request on a per client/session basis?

I'd like to be able to tell someone (django, pgBouncer, or whoever can provide me with this service) to always hand me the same connection to the database (PostgreSQL in this case) on a per client/session basis, instead of getting a random one each time (or creating a new one for that matter).
To my knowledge:
Django's CONN_MAX_AGE can control the lifetime of connections, so
far so good. This will also have a positive impact on performance
(no connection setup penalties).
Some pooling package (pgBouncer for example) can hold the connections and hand them to me as I need them. We're almost there.
The only bit I'm missing is the possibility to ask pgBouncer (or any other similar tool for that matter) to give me a specific db connection, instead of "one from the pool". This is important because I want to have control over the lifetime of the transaction. I want to be able to open a transaction, then send a series of commands, and then manually commit all the work, or roll everything back should something fail.
Many years ago, I've implemented something very similar to what I'm looking for now. It was a simple connection pool made in C which would hold as many connections to oracle as clients needed on one hand, while on the other it would give these clients the chance to recover these exact connections based on some ID, which could have been for example a PHP session ID. That way users could acquire a lock on some database object/row, and the lock would persist even after the apache process died. From that point on the session owner was in total control of that row until he decided it was time to commit it, or until the backend decided it was time to let the transaction go by idleness.

CherryPy with embedded database (SQLite)

I am developing an application that uses CherryPy. Now I need to implement a database, and I would very much like it to be embedded with the app, to save users some headache. The obvious first choice is of course SQLite, seeing how it's part of the standard library.
There seems to be a lot of different takes on this. Some saying that you should never use SQLite in a threaded application, some saying it's ok, and with wildly differing estimates of how many writes per second I can expect.
Is using SQLite in this way viable, and how slow can I expect writing to the database will be?
If viable, what is the best method of implementing it? Subscribing a connection to each start_thread? Start a connection every time a page is exposed, as some seem to do?
I've read that turning PRAGMA synchronous=OFF in SQLite can improve performance at the cost of "if you lose power in the middle of a transaction, your database file might go corrupt." What are the probabilities here? Is this an acceptable choice perhaps in conjunction with some sort of backup system?
Are there any other embedded databases that would be a better choice?
Should I just give up on this and use a PostgreSQL database at the cost of user convenience?
Thanks in advance.

How to ensure several Python processes access the data base one by one?

I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.

Need suggestion about MMORPG data model design, database access and stackless python [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm engaged in developing a turn-based casual MMORPG game server.
The low level engine(NOT written by us) which handle networking,
multi-threading, timer, inter-server communication, main game loop etc, was
written by C++. The high level game logic was written by Python.
My question is about the data model design in our game.
At first we simply try to load all data of a player into RAM and a shared data
cache server when client login and schedule a timer periodically flush data into
data cache server and data cache server will persist data into database.
But we found this approach has some problems
1) Some data needs to be saved or checked instantly, such as quest progress,
level up, item & money gain etc.
2) According to game logic, sometimes we need to query some offline player's
data.
3) Some global game world data needs to be shared between different game
instances which may be running on a different host or a different process on the
same host. This is the main reason we need a data cache server sits between game
logic server and database.
4) Player needs freely switch between game instances.
Below is the difficulty we encountered in the past:
1) All data access operation should be asynchronized to avoid network I/O
blocking the main game logic thread. We have to send message to database or
cache server and then handle data reply message in callback function and
continue proceed game logic. It quickly become painful to write some moderate
complex game logic that needs to talk several times with db and the game logic
is scattered in many callback functions makes it hard to understand and
maintain.
2) The ad-hoc data cache server makes things more complex, we hard to maintain
data consistence and effectively update/load/refresh data.
3) In-game data query is inefficient and cumbersome, game logic need to query
many information such as inventory, item info, avatar state etc. Some
transaction machanism is also needed, for example, if one step failed the entire
operation should be rollback. We try to design a good data model system in RAM,
building a lot of complex indexs to ease numerous information query, adding
transaction support etc. Quickly I realized what we are building is a in-memory
database system, we are reinventing the wheel...
Finally I turn to the stackless python, we removed the cache server. All data
are saved in database. Game logic server directly query database. With stackless
python's micro tasklet and channel, we can write game logic in a synchronized
way. It is far more easy to write and understand and productivity greatly
improved.
In fact, the underlying DB access is also asynchronized: One client tasklet
issue request to another dedicate DB I/O worker thread and the tasklet is
blocked on a channel, but the entire main game logic is not blocked, other
client's tasklet will be scheduled and run freely. When DB data reply the
blocked tasklet will be waken up and continue to run on the 'break
point'(continuation?).
With above design, I have some questions:
1) The DB access will be more frequently than previous cached solution, does the
DB can support high frequent query/update operation? Does some mature cache
solution such as redis, memcached is needed in near future?
2) Are there any serious pitfalls in my design? Can you guys give me some better
suggestions, especially on in-game data management pattern.
Any suggestion would be appreciated, thanks.
I've worked with one MMO engine that operated in a somewhat similar fashion. It was written in Java, however, not Python.
With regards to your first set of points:
1) async db access We actually went the other route, and avoided having a “main game logic thread.” All game logic tasks were spawned as new threads. The overhead of thread creation and destruction was completely lost in the noise floor compared to I/O. This also preserved the semantics of having each “task” as a reasonably straightforward method, instead of the maddening chain of callbacks that one otherwise ends up with (although there were still cases of this.) It also meant that all game code had to be concurrent, and we grew increasingly reliant upon immutable data objects with timestamps.
2) ad-hoc cache We employed a lot of WeakReference objects (I believe Python has a similar concept?), and also made use of a split between the data objects, e.g. “Player”, and the “loader” (actually database access methods) e.g. “PlayerSQLLoader;” the instances kept a pointer to their Loader, and the Loaders were called by a global “factory” class that would handle cache lookups versus network or SQL loads. Every “Setter” method in a data class would call the method changed, which was an inherited boilerplate for myLoader.changed (this);
In order to handle loading objects from other active servers, we employed “proxy” objects that used the same data class (again, say, “Player,”) but the Loader class we associated was a network proxy that would (synchronously, but over gigabit local network) update the “master” copy of that object on another server; in turn, the “master” copy would call changed itself.
Our SQL UPDATE logic had a timer. If the backend database had received an UPDATE of the object within the last ($n) seconds (we typically kept this around 5), it would instead add the object to a “dirty list.” A background timer task would periodically wake and attempt to flush any objects still on the “dirty list” to the database backend asynchronously.
Since the global factory maintained WeakReferences to all in-core objects, and would look for a single instantiated copy of a given game object on any live server, we would never attempt to instantiate a second copy of one game object backed by a single DB record, so the fact that the in-RAM state of the game might differ from the SQL image of it for up to 5 or 10 seconds at a time was inconsequential.
Our entire SQL system ran in RAM (yes, a lot of RAM) as a mirror to another server who tried valiantly to write to disc. (That poor machine burned out RAID drives on average of once every 3-4 months due to “old age.” RAID is good.)
Notably, the objects had to be flushed to database when being removed from cache, e.g. due to exceeding the cache RAM allowance.
3) in-memory database … I hadn't run across this precise situation. We did have “transaction-like” logic, but it all occurred on the level of Java getters/setters.
And, in regards to your latter points:
1) Yes, PostgreSQL and MySQL in particular deal well with this, particularly when you use a RAMdisk mirror of the database to attempt to minimize actual HDD wear and tear. In my experience, MMO's do tend to hammer the database more than is strictly necessary, however. Our “5 second rule”* was built specifically to avoid having to solve the problem “correctly.” Each of our setters would call changed. In our usage pattern, we found that an object typically had either 1 field changed, and then no activity for some time, or else had a “storm” of updates happen, where many fields changed in a row. Building proper transactions or so (e.g. informing the object that it was about to accept many writes, and should wait for a moment before saving itself to the DB) would have involved more planning, logic, and major rewrites of the system; so, instead, we bypassed the situation.
2) Well, there's my design above :-)
In point of fact, the MMO engine I'm presently working on uses even more reliance upon in-RAM SQL databases, and (I hope) will be doing so a bit better. However, that system is being built using an Entity-Component-System model, rather than the OOP model that I described above.
If you already are based on an OOP model, shifting to ECS is a pretty paradigm shift and, if you can make OOP work for your purposes, it's probably better to stick with what your team already knows.
*- “the 5 second rule” is a colloquial US “folk belief” that after dropping food on the floor, it's still OK to eat it if you pick it up within 5 seconds.
It's difficult to comment on the entire design/datamodel without greater understanding of the software, but it sounds like your application could benefit from an in-memory database.* Backing up such databases to disk is (relatively speaking) a cheap operation. I've found that it is generally faster to:
A) Create an in-memory database, create a table, insert a million** rows into the given table, and then back-up the entire database to disk
than
B) Insert a million** rows into a table in a disk-bound database.
Obviously, single record insertions/updates/deletions also run faster in-memory. I've had success using JavaDB/Apache Derby for in-memory databases.
*Note that the database need not be embedded in your game server.
**A million may not be an ideal size for this example.

Mysql Connection, one or many?

I'm writing a script in python which basically queries WMI and updates the information in a mysql database. One of those "write something you need" to learn to program exercises.
In case something breaks in the middle of the script, for example, the remote computer turns off, it's separated out into functions.
Query Some WMI data
Update that to the database
Query Other WMI data
Update that to the database
Is it better to open one mysql connection at the beginning and leave it open or close the connection after each update?
It seems as though one connection would use less resources. (Although I'm just learning, so this is a complete guess.) However, opening and closing the connection with each update seems more 'neat'. Functions would be more stand alone, rather than depend on code outside that function.
"However, opening and closing the connection with each update seems more 'neat'. "
It's also a huge amount of overhead -- and there's no actual benefit.
Creating and disposing of connections is relatively expensive. More importantly, what's the actual reason? How does it improve, simplify, clarify?
Generally, most applications have one connection that they use from when they start to when they stop.
I don't think that there is "better" solution. Its too early to think about resources. And since wmi is quite slow ( in comparison to sql connection ) the db is not an issue.
Just make it work. And then make it better.
The good thing about working with open connection here, is that the "natural" solution is to use objects and not just functions. So it will be a learning experience( In case you are learning python and not mysql).
Think for a moment about the following scenario:
for dataItem in dataSet:
update(dataItem)
If you open and close your connection inside of the update function and your dataSet contains a thousand items then you will destroy the performance of your application and ruin any transactional capabilities.
A better way would be to open a connection and pass it to the update function. You could even have your update function call a connection manager of sorts. If you intend to perform single updates periodically then open and close your connection around your update function calls.
In this way you will be able to use functions to encapsulate your data operations and be able to share a connection between them.
However, this approach is not great for performing bulk inserts or updates.
Useful clues in S.Lott's and Igal Serban's answers. I think you should first find out your actual requirements and code accordingly.
Just to mention a different strategy; some applications keep a pool of database (or whatever) connections and in case of a transaction just pull one from that pool. It seems rather obvious you just need one connection for this kind of application. But you can still keep a pool of one connection and apply following;
Whenever database transaction is needed the connection is pulled from the pool and returned back at the end.
(optional) The connection is expired (and of replaced by a new one) after a certain amount of time.
(optional) The connection is expired after a certain amount of usage.
(optional) The pool can check (by sending an inexpensive query) if the connection is alive before handing it over the program.
This is somewhat in between single connection and connection per transaction strategies.

Categories