I have a MongoDB cluster (2.6.3) with three mongos processes and two replica sets, with no sharding enabled.
In particular, I have 7 hosts (all are Ubuntu Server 14.04):
host1: mongos + Client aplication
host2: mongos + Client aplication
host3: mongos + Client aplication
host4: RS1_Primary (or RS1_Secondary) and RS2_Arbitrer
host5: RS1_Secondary (or RS1_Primary)
host6: RS2_Primary (or RS2_Secondary) and RS1_Arbitrer
host7: RS2_Secondary (or RS2_Primary)
The Client application here is a Zato Cluster with 4 gunicorn workers running in each server which accesses MongoDB using two PyMongo.MongoClient instances for each worker.
These MongoClient objects are created as follows:
MongoClient(mongo_hosts, read_preference=ReadPreference.SECONDARY_PREFERRED, w=0, max_pool_size=25)
MongoClient(mongo_hosts, read_preference=ReadPreference.SECONDARY_PREFERRED, w=0, max_pool_size=10)
where this mongo_hosts is: 'host1:27017,host2:27017,host2:27017' in all servers.
So, in total, I have 12 MongoClient instances with max_pool_size=25 (4 in each server) and 12 others with max_pool_size=10 (also 4 in each server)
And my problem is:
When the Zato clusters are started and begin receiving requests (up to 10 rq/sec each, balanced using a simple round robin), a bunch of new connections are created and around 15-20 are then kept permanently open over the time in each mongos.
However, at some random point and with no apparent cause, a couple of connections are suddenly dropped at the same time in all three mongos and then the total number of connections keeps changing randomly until it stabilizes again after some minutes (from 5 to 10).
And while this happens, even though I see no slow queries in MongoDB logs (neither in mongos nor in mongod) the performance of the platform is severely reduced.
I have been isolating the problem and already tried to:
change the connection string to 'localhost:27017' in each MongoClient to see if the problem was in only one of the clients. The problem persisted, and it keeps affecting the three mongos at the same time, so it looks like something in the server side.
add log traces to make sure that the performance is lost inside MongoClient. The result is that running a simple find query in MongoClient is clearly seen to last more than one second in the client side, while usually it's less than 10ms. However, as I said before, I see no slow queries at all in MongoDB logs (default profiling level: 100ms).
monitor the platform activity to see if there's a load increase when this happens. There's none, and indeed it can even happen during low load periods.
monitor other variables in the servers, such as cpu usage or disk activity. I found nothing suspicious at all.
So, the questions at the end are:
Has anyone seen something similar (connections being dropped in PyMongo)?
What else can I look at to debug the problem?
Possible solution: MongoClient allows the definition of a max_pool_size, but I haven't found any reference to a min_pool_size. Is it possible to define so? Perhaps making the number of connections static would fix my performance problems.
Note about MongoDB version: I am currently running MongoDB 2.6.3 but I already had this problem before upgrading from 2.6.1, so it's nothing introduced in the last version.
Related
I am facing this weird issue. Some (5%) of my celery tasks are silently being dropped.
Doing some digging in celery logs, I found that in some cases, same task IDs get generated for different tasks. Naturally, any new task overwrites an existing task with the same task ID; causing the old task to silently drop (if it wasn't executed).
In a span of 1.5 hours, the same UUID was generated 3 times. I did some random sampling and this turned to be the case on the same machine, in a short span (1-2 hours). The server generates around 1 million UUIDs a day. A minuscule number with 7 digits compared to a number with 38 digits- the number of possible UUIDs.
I am running python 3.6, and celery 4.4.2 on a Linux VM.
Celery uses python's uuid.uuid4: Reference
I'm not sure how to proceed from here. Is there a bug in a version of python (or the linux kernel), some configuration issue, or a hardware/VM bug? All scenarios seem very unlikely.
Update:
The VM is a standard Google Cloud Plaftform compute instance running ubuntu 18 LTS.
I couldn't figure out why but I implemented a workaround.
I monkey patched uuid.uuid4. For some reason I was unable to do the same for celery.utils.uuid or kombu.utils.uuid.
I made a very simple random number generator that concatenates the system nano time, and the hostname, and generates a UUID:
def __my_uuid_generator():
time_hex = float.hex(time.monotonic())[4:-4] # 13 chars
host = hex(abs(hash(socket.gethostname())))[2:] # 16 chars
hashed = bytes(f'{time_hex}{host}', 'ascii').hex()[:32] # always a 32 chars long hex string
return uuid.UUID(hashed)
# Monkey patch uuid4, because https://stackoverflow.com/q/62312607/1396264. Sigh!
uuid.uuid4 = __my_uuid_generator
We have built a series of programs that process data. In the last step, they write out a series of Neo4j commands to create nodes, and later, to connect those nodes. This is a sample of what is created.
CREATE (n1:ActionElement {
nodekey:1,
name:'type',
resource: 'action',
element:'ActionElement',
description:'CompilationUnit',
linestart:1,
colstart:7,
lineend:454,
colend:70,
content:[],
level:1,
end:False
});
The issue is that the file created has ~20,000 lines. When I run it through the shell, I get an error on some of the transactions. It seems to alternately process and reject. I cant see a pattern but I am assuming that I am overruning the processing speed.
neo4j> CREATE (n1573)-[:sibling]->(n1572);
Connection refused
neo4j> CREATE (n1574)-[:sibling]->(n1573);
Connection refused
neo4j> CREATE (n1575)-[:sibling]->(n1574);
0 rows available after 3361 ms, consumed after another 2 ms
Added 2 nodes, Created 1 relationships
neo4j> CREATE (n1579)-[:sibling]->(n1578);
0 rows available after 78 ms, consumed after another 0 ms*
Interesting enough, it recovers, fails, recovers.
Any thoughts ? is this just fundamentally the wrong way to do this ? The LAST program to touch it happens to be python, should I have it update the database direct ? Thank you
In the end, it was a network issue combined with a transaction issue. To solve, we FTPd the file to the same server as the Neo4j instance (eliminating the network latency) and then modified the python load to try/except on error, and if it failed, wait 5 seconds and retry. Only issued an error when the second try failed. That eliminated any 'transaction latency'. On a 16 core machine, it did not fail at all. On a single core, it retried 4 times across 20,000 updates, and all passed on the second try. Not ideal, but workable.
Thanks
We are having issues recently with our prod servers connecting to Oracle. Intermittently we are getting "DatabaseError: ORA-12502: TNS:listener received no CONNECT_DATA from client". This issue is completely random and goes away in a second by itself and it's not a Django problem, can replicate it with SQLPlus from the servers.
We opened ticket with Oracle support but in the meantime i'm wondering if it's possible to simply retry any DB-related operation when it fails.
The problem is that i can't use try/catch blocks in the code to handle this since this can happen on ANY DB interaction in the entire codebase. I have to do this at a lower level so that i do it only once. Is there any way to install an error handler or something like that directly on django.db.backends.oracle level so that it will cover all the codebase? Basically, all i want to do is this:
try:
execute_sql()
catch:
if code == ORA-12502:
sleep 1 second
#re-try the same operation
exexute_sql()
Is this even possible or I'm out of luck?
Thanks!
I have a small python script that basically connects to a SQL Server (Micrsoft) database and gets users from there, and then syncs them to another mysql database, basically im just running queries to check if the user exists, if not, then add that user to the mysql database.
The script usually would take around 1 min to sync. I require the script to do its work every 5 mins (for example) exactly once (one sync per 5 mins).
How would be the best way to go about building this?
I have some test data for the users but on the real site, theres a lot more users so I can't guarantee the script takes 1 min to execute, it might even take 20 mins. However having an interval of say 15 mins everytime the script executes would be ideal for the problem...
Update:
I have the connection params for the sql server windows db, so I'm using a small ubuntu server to sync between the two databases located on different servers. So lets say db1 (windows) and db2 (linux) are the database servers, I'm using s1 (python server) and pymssql and mysql python modules to sync.
Regards
I am not sure cron is right for the job. It seems to me that if you have it run every 15 minutes but sometimes a synch takes 20 minutes you could have multiple processes running at once and possibly collide.
If the driving force is a constant wait time between the variable execution times then you might need a continuously running process with a wait.
def main():
loopInt = 0
while(loopInt < 10000):
synchDatabase()
loopInt += 1
print("call #" + str(loopInt))
time.sleep(300) #sleep 5 minutes
main()
(obviously not continuous, but long running) You can set the result of while to true and it will be continuous. (comment out loopInt += 1)
Edited to add: Please see note in comments about monitoring the process as you don't want the script to hang or crash and you not be aware of it.
You might want to use a system that handles queues, for example RabbitMQ, and use Celery as the python interface to implement it. With Celery, you can add tasks (like execution of a script) to a queue or run a schedule that'll perform a task after a given interval (just like cron).
Get started http://celery.readthedocs.org/en/latest/
I have a one year production site configured with django.contrib.sessions.backends.cached_db backend with a MySQL database backend. The reason why I chose cached_db is a mix of security with read performance.
The problem is, the cleanup command, responsible to delete all expired sessions, was never executed, resulting in a 2.3GB session table data length, 6 million rows and 500Mb index length.
When I try to run the ./manage.py cleanup (in Django 1.3) command, or ./manage.py clearsessions (Django`s 1.5 correspondent), the process never ends (or my patience doesn't complete 3 hours).
The code that Django use's to do this is:
Session.objects.filter(expire_date__lt=timezone.now()).delete()
In a first impression, I think that's normal because the table has 6M rows, but, after I inspect System's monitor, I discover that all memory and cpu was used by the python process, not mysqld, fullfilling my machine's resources. I think that's something terrible wrong with this command code. It seems that python iterates over all founded expired session rows before deleting each of them, one by one. In this case, a code refactoring to just raw a DELETE FROM command can resolve my problem and helps Django community, right? But, if this is the case, a Queryset delete command is acting weird and none optimized in my opinion. Am I right?