an efficient way to create one connection with multiple replicate databases

an efficient way to create one connection with multiple replicate databases - python

Assume there are 10 replicate databases which are for better reliability of the service. we need to design and implementation this to almost guarantee one connection.
Factors that are important
OOP principles, Design Pattern
Efficiency and Performance
So probably the first approach that comes to mind(which is not good) is using Try Except to cover all of the databases and keep trying for the next one that creates a healthy connection.
However this might work but goal is maintainability for a large scale project.
e.x
Class ConnectDatabase:
def __init__(self):
self.config = ['config1','config2'] # 10 configs for example
#from dotenv or hard-coded
for conf in self.configs:
try:
self.db_connection = Connection(conf)
break
except Excepton as error:
self.logger.error(error)
Problems
Bad Design
Hard to automate tests
to maintain future testing we can move self. config to arguments but that makes new problems when creating objects. (Dependency Injection maybe?!)
So what do you think? how should we handle this problem at the high level of writing code? I guess there should be a way to handle config outside this main connection class. Any suggestion or particular design pattern?
Thanks

Related

How to approach this unit-testing issue based on an external class?

I have a class Node (add_node.py) in which I create nodes that connect to a websocket. Now I want to write unit-tests for checking whether or not the connection was successful, and for checking the response of the server.
So I created a node_tests.py file which the following content:
import unittest
import json
import re
from add_node import Node
class TestNodes(unittest.TestCase):
def test_node_creation(self):
self.node = Node(a='1', b='2', c=True)
self.response = json.loads(self.node.connect())
self.assertIn('ok', self.response['r'])
def test_node_c(self):
self.assertTrue(self.response['c'])
if __name__ == '__main__':
unittest.main()
The first method is working but the second is failing because there is no attribute 'response'. So how could I approach this problem?
Also, is it ok to do it they way I'm doing it? Importing the class and writing multiple test within the same Test class?

The point of a unit test is to verify the functionality of a single isolated unit. Exactly what a unit is can be debated. Some would argue it's a single public method on a single class. I think most would agree that is not a very useful definition though. I like to think of them as use-cases. Your tests should do one use-case, and then make one assertion about the results from that use-case. This means that sometimes it's OK to let classes collaborate, and sometimes it's better to isolate a class and use test doubles for it's dependencies.
Knowing when to isolate is something you learn over time. I think the most important points to consider are that every test you write should
Fully define the context in which it's run (without depending on global state or previously run tests)
Be fast, a few milliseconds tops (this means you don't touch external resources like the file system, a web server or some database)
Not test a bunch of other things that are covered by other tests.
This third point is probably the hardest to balance. Obviously several tests will run the same code if they're making different assertions on the same use-case. You should keep the tests small though. Let's say we want to test the cancellation process of orders in an e-commerce application. In this case we probably don't want to import and set up all the classes used to create, verify, etc. an order before we cancel it. In that case it's probably a better idea to just create an order object manually or maybe use a test double.
In your case, if what you actually want to do is to test the real connection and the responses the real server gives, you don't want a unit test, you want an integration test.
If what you actually want is to test the business logic of your client class, however, you should probably create a fake socket/server where you can yourself define the responses and whether or not the connection is successful. This allows you to test that your client behaves correctly depending on it's communications with the remote server, without actually having to depend on the remote server in your test suite.

How do you test the consistency models and race conditions on Google App Engine / NDB?

Setup: Python, NDB, the GAE datastore. I'm trying to make sure i understand some of the constraints around my data model and its tradeoffs for consistency and max-write-rate. Is there any way to test/benchmark this on staging or on my dev box, or should I just bite the bullet, push to prod on a shadow site of sorts, and write a bunch of scripts?

You can use PseudoRandomHRConsistencyPolicy to control consistency in your tests. However, there are no way to test max-write-rate, as I know.
import unittest
from google.appengine.ext import testbed, ndb
from google.appengine.datastore import datastore_stub_util
class Foo(ndb.Model):
pass
class TestConsistency(unittest.TestCase):
def setUp(self):
self.testbed = testbed.Testbed()
self.testbed.activate()
def tearDown(self):
self.testbed.deactivate()
def test_consistency(self):
self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(
probability=1)
self.testbed.init_datastore_v3_stub(consistency_policy=self.policy)
foo = Foo()
foo.put()
self.assertEqual(Foo.query().count(), 1)
def test_consistency_failed(self):
self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(
probability=0)
self.testbed.init_datastore_v3_stub(consistency_policy=self.policy)
foo = Foo()
foo.put()
self.assertEqual(Foo.query().count(), 0)

You really need to do testing in the real environment. At best the dev environment is an approximation of production. You certainly can't draw any conclusions at all about performance by just using the SDK. In many cases the SDK is faster (startup times) and slower (queries on large datasets. Eventual Consistency is emulated and not 100% the same as production.

I am not sure it can be tested. The inconsistencies are inconsistent. I think you just have to know that datastore operations have inconsistencies, and code around them. You don't want to plan on observations from your tests being dependable in the future.

I am answering this over a year since it was asked.
The only way to test these sorts of things is by deploying an app on GAE. What I sometimes do when I run across these challenges is to just "whip up" a quick application that is tailor made to just test the scenario under consideration. And then, as you put it, you just have to 'script' the doing of stuff using some combination of tasks, cron, and client side curl type operations.
The particular tradeoff in the original question is write throughput versus consistency. This is actually pretty straightforward once you get the hang of it. A strongly consistent query requires that the entities are in the same entity group. And, at the same time, there is the constraint that a given entity group may only have approximately 1 write per second.
So, you have to look at your needs / usage pattern to figure out if you can use an entity group.

Need suggestion about MMORPG data model design, database access and stackless python [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm engaged in developing a turn-based casual MMORPG game server.
The low level engine(NOT written by us) which handle networking,
multi-threading, timer, inter-server communication, main game loop etc, was
written by C++. The high level game logic was written by Python.
My question is about the data model design in our game.
At first we simply try to load all data of a player into RAM and a shared data
cache server when client login and schedule a timer periodically flush data into
data cache server and data cache server will persist data into database.
But we found this approach has some problems
1) Some data needs to be saved or checked instantly, such as quest progress,
level up, item & money gain etc.
2) According to game logic, sometimes we need to query some offline player's
data.
3) Some global game world data needs to be shared between different game
instances which may be running on a different host or a different process on the
same host. This is the main reason we need a data cache server sits between game
logic server and database.
4) Player needs freely switch between game instances.
Below is the difficulty we encountered in the past:
1) All data access operation should be asynchronized to avoid network I/O
blocking the main game logic thread. We have to send message to database or
cache server and then handle data reply message in callback function and
continue proceed game logic. It quickly become painful to write some moderate
complex game logic that needs to talk several times with db and the game logic
is scattered in many callback functions makes it hard to understand and
maintain.
2) The ad-hoc data cache server makes things more complex, we hard to maintain
data consistence and effectively update/load/refresh data.
3) In-game data query is inefficient and cumbersome, game logic need to query
many information such as inventory, item info, avatar state etc. Some
transaction machanism is also needed, for example, if one step failed the entire
operation should be rollback. We try to design a good data model system in RAM,
building a lot of complex indexs to ease numerous information query, adding
transaction support etc. Quickly I realized what we are building is a in-memory
database system, we are reinventing the wheel...
Finally I turn to the stackless python, we removed the cache server. All data
are saved in database. Game logic server directly query database. With stackless
python's micro tasklet and channel, we can write game logic in a synchronized
way. It is far more easy to write and understand and productivity greatly
improved.
In fact, the underlying DB access is also asynchronized: One client tasklet
issue request to another dedicate DB I/O worker thread and the tasklet is
blocked on a channel, but the entire main game logic is not blocked, other
client's tasklet will be scheduled and run freely. When DB data reply the
blocked tasklet will be waken up and continue to run on the 'break
point'(continuation?).
With above design, I have some questions:
1) The DB access will be more frequently than previous cached solution, does the
DB can support high frequent query/update operation? Does some mature cache
solution such as redis, memcached is needed in near future?
2) Are there any serious pitfalls in my design? Can you guys give me some better
suggestions, especially on in-game data management pattern.
Any suggestion would be appreciated, thanks.

I've worked with one MMO engine that operated in a somewhat similar fashion. It was written in Java, however, not Python.
With regards to your first set of points:
1) async db access We actually went the other route, and avoided having a “main game logic thread.” All game logic tasks were spawned as new threads. The overhead of thread creation and destruction was completely lost in the noise floor compared to I/O. This also preserved the semantics of having each “task” as a reasonably straightforward method, instead of the maddening chain of callbacks that one otherwise ends up with (although there were still cases of this.) It also meant that all game code had to be concurrent, and we grew increasingly reliant upon immutable data objects with timestamps.
2) ad-hoc cache We employed a lot of WeakReference objects (I believe Python has a similar concept?), and also made use of a split between the data objects, e.g. “Player”, and the “loader” (actually database access methods) e.g. “PlayerSQLLoader;” the instances kept a pointer to their Loader, and the Loaders were called by a global “factory” class that would handle cache lookups versus network or SQL loads. Every “Setter” method in a data class would call the method changed, which was an inherited boilerplate for myLoader.changed (this);
In order to handle loading objects from other active servers, we employed “proxy” objects that used the same data class (again, say, “Player,”) but the Loader class we associated was a network proxy that would (synchronously, but over gigabit local network) update the “master” copy of that object on another server; in turn, the “master” copy would call changed itself.
Our SQL UPDATE logic had a timer. If the backend database had received an UPDATE of the object within the last ($n) seconds (we typically kept this around 5), it would instead add the object to a “dirty list.” A background timer task would periodically wake and attempt to flush any objects still on the “dirty list” to the database backend asynchronously.
Since the global factory maintained WeakReferences to all in-core objects, and would look for a single instantiated copy of a given game object on any live server, we would never attempt to instantiate a second copy of one game object backed by a single DB record, so the fact that the in-RAM state of the game might differ from the SQL image of it for up to 5 or 10 seconds at a time was inconsequential.
Our entire SQL system ran in RAM (yes, a lot of RAM) as a mirror to another server who tried valiantly to write to disc. (That poor machine burned out RAID drives on average of once every 3-4 months due to “old age.” RAID is good.)
Notably, the objects had to be flushed to database when being removed from cache, e.g. due to exceeding the cache RAM allowance.
3) in-memory database … I hadn't run across this precise situation. We did have “transaction-like” logic, but it all occurred on the level of Java getters/setters.
And, in regards to your latter points:
1) Yes, PostgreSQL and MySQL in particular deal well with this, particularly when you use a RAMdisk mirror of the database to attempt to minimize actual HDD wear and tear. In my experience, MMO's do tend to hammer the database more than is strictly necessary, however. Our “5 second rule”* was built specifically to avoid having to solve the problem “correctly.” Each of our setters would call changed. In our usage pattern, we found that an object typically had either 1 field changed, and then no activity for some time, or else had a “storm” of updates happen, where many fields changed in a row. Building proper transactions or so (e.g. informing the object that it was about to accept many writes, and should wait for a moment before saving itself to the DB) would have involved more planning, logic, and major rewrites of the system; so, instead, we bypassed the situation.
2) Well, there's my design above :-)
In point of fact, the MMO engine I'm presently working on uses even more reliance upon in-RAM SQL databases, and (I hope) will be doing so a bit better. However, that system is being built using an Entity-Component-System model, rather than the OOP model that I described above.
If you already are based on an OOP model, shifting to ECS is a pretty paradigm shift and, if you can make OOP work for your purposes, it's probably better to stick with what your team already knows.
*- “the 5 second rule” is a colloquial US “folk belief” that after dropping food on the floor, it's still OK to eat it if you pick it up within 5 seconds.

It's difficult to comment on the entire design/datamodel without greater understanding of the software, but it sounds like your application could benefit from an in-memory database.* Backing up such databases to disk is (relatively speaking) a cheap operation. I've found that it is generally faster to:
A) Create an in-memory database, create a table, insert a million** rows into the given table, and then back-up the entire database to disk
than
B) Insert a million** rows into a table in a disk-bound database.
Obviously, single record insertions/updates/deletions also run faster in-memory. I've had success using JavaDB/Apache Derby for in-memory databases.
*Note that the database need not be embedded in your game server.
**A million may not be an ideal size for this example.

Pymongo, connection pooling and asynchronous tasks via Celery

I'm using pymongo to access mongodb in an application that also uses Celery to perform many asynchronous tasks. I know pymongo's connection pooling does not support asynchronous workers (based on the docs).
To access collections I've got a Collection class wrapping certain logic that fits my application. I'm trying to make sense of some code that I inherited with this wrapper:
Each collection at the moment creates its own Connection instance. Based on what I'm reading this is wrong and I should really have a single Connection instance (in settings.py or such) and import it into my Collection instances. That bit is clear. Is there a guideline as far as the maximum connections recommended? The current code surely creates a LOT of connections/sockets as its not really making use of the pooling facilities.
However, as some code is called from both asynchronous celery tasks as well as being run synchronously, I'm not sure how to handle this. My thought is to instantiate new Connection instances for the tasks and use the single one for for the synchronous ones (ending_request of course after each activity is done). Is this the right direction?
Thanks!
Harel

From pymongo's docs : "PyMongo is thread-safe and even provides built-in connection pooling for threaded applications."
The word "asynchronous" in your situation can be translated into how "inconsistent" requirements your application has.
Statements like "x += 1" will never be consistent in your app. If you can afford this, there is no problem. If you have "critical" operations you must somehow implement some locks for synchronization.
As for the maximum connections, I don't know exact numbers, so test and proceed.
Also take a look at Redis and this example, if speed and memory efficiency are required. From some benchmarks I made, Redis python driver is at least 2x faster than pymongo, for reads/writes.

How to correctly achieve test isolation with a stateful Python module?

The project I'm working on is a business logic software wrapped up as a Python package. The idea is that various script or application will import it, initialize it, then use it.
It currently has a top level init() method that does the initialization and sets up various things, a good example is that it sets up SQLAlchemy with a db connection and stores the SA session for later access. It is being stored in a subpackage of my project (namely myproj.model.Session, so other code could get a working SA session after import'ing the model).
Long story short, this makes my package a stateful one. I'm writing unit tests for the project and this stafeful behaviour poses some problems:
tests should be isolated, but the internal state of my package breaks this isolation
I cannot test the main init() method since its behavior depends on the state
future tests will need to be run against the (not yet written) controller part with a well known model state (eg. a pre-populated sqlite in-memory db)
Should I somehow refactor my package because the current structure is not the Best (possible) Practice(tm)? :)
Should I leave it at that and setup/teardown the whole thing every time? If I'm going to achieve complete isolation that'd mean fully erasing and re-populating the db at every single test, isn't that overkill?
This question is really on the overall code & tests structure, but for what it's worth I'm using nose-1.0 for my tests. I know the Isolate plugin could probably help me but I'd like to get the code right before doing strange things in the test suite.

You have a few options:
Mock the database
There are a few trade offs to be aware of.
Your tests will become more complex as you will have to do the setup, teardown and mocking of the connection. You may also want to do verification of the SQL/commands sent. It also tends to create an odd sort of tight coupling which may cause you to spend additonal time maintaining/updating tests when the schema or SQL changes.
This is usually the purest for of test isolation because it reduces a potentially large dependency from testing. It also tends to make tests faster and reduces the overhead to automating the test suite in say a continuous integration environment.
Recreate the DB with each Test
Trade offs to be aware of.
This can make your test very slow depending on how much time it actually takes to recreate your database. If the dev database server is a shared resource there will have to be additional initial investment in making sure each dev has their own db on the server. The server may become impacted depending on how often tests get runs. There is additional overhead to running your test suite in a continuous integration environment because it will need at least, possibly more dbs (depending on how many branches are being built simultaneously).
The benefit has to do with actually running through the same code paths and similar resources that will be used in production. This usually helps to reveal bugs earlier which is always a very good thing.
ORM DB swap
If your using an ORM like SQLAlchemy their is a possibility that you can swap the underlying database with a potentially faster in-memory database. This allows you to mitigate some of the negatives of both the previous options.
It's not quite the same database as will be used in production, but the ORM should help mitigate the risk that obscures a bug. Typically the time to setup an in-memory database is much shorter that one which is file-backed. It also has the benefit of being isolated to the current test run so you don't have to worry about shared resource management or final teardown/cleanup.

Working on a project with a relatively expensive setup (IPython), I've seen an approach used where we call a get_ipython function, which sets up and returns an instance, while replacing itself with a function which returns a reference to the existing instance. Then every test can call the same function, but it only does the setup for the first one.
That saves doing a long setup procedure for every test, but occasionally it creates odd cases where a test fails or passes depending on what tests were run before. We have ways of dealing with that - a lot of the tests should do the same thing regardless of the state, and we can try to reset the object's state before certain tests. You might find a similar trade-off works for you.

Mock is a simple and powerfull tool to achieve some isolation. There is a nice video from Pycon2011 which shows how to use it. I recommend to use it together with py.test which reduces the amount of code required to define tests and is still very, very powerfull.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.