shared objects after os.fork()

shared objects after os.fork() - python

I've encountered some strange application behaviour while interacting with database using many processes. I'm using Linux.
I have my own implementation of QueryExecutor which uses the a single connection during its lifetime:
class QueryExecutor(object):
def __init__(self, db_conf):
self._db_config = db_conf
self._conn = self._get_connection()
def execute_query(self, query):
# some code
# some more code
def query_executor():
global _QUERY_EXECUTOR
if _QUERY_EXECUTOR is None:
_QUERY_EXECUTOR = QueryExecutor(some_db_config)
return _QUERY_EXECUTOR
Query Executor is never modified after instantiation.
Initially there is only one process, which from time to time forks (os.fork()) several times. The new processes are workers which do some tasks and then exit. Each worker calls query_executor() to be able to execute a SQL query.
I have found out that sql queries often return wrong results (it seems that sometimes sql query result is returned to the wrong process). The only sensible explanation is all processes share the same sql connection (according to MySQLdb doc: threadsafety = 1 Threads may share the module, but not connections).
I wonder which OS mechanism leads to this situation. As far as I know, on Linux when process forks, the parent process's pages are not copied for the child process, they are shared by both processes until one of them tries to modify some page (copy-on-write). As I have mentioned before, QueryExecutor object remains unmodified after creation. I guess this is the reason for the fact that all processes uses the same QueryExecutor instance and hence the same sql connection.
Am I right or do I miss something? Do you have any suggestions?
Thanks in advance!
Grzegorz

The root of the problem is that fork() simply creates an exact independent copy of a process, but these two processes share opened files, sockets and pipes. That's why any data written by MySQL server may be [correctly] read only from a single process and if two processes try to make requests and read responses then they quite likely will mess up each other work. This has nothing with "multithreading" because in case of multi-threading there's a single process with few threads of executions, they share data and may coordinate.
The correct way to use fork() is to close (or re-open) right after forking all file-handle-like objects in all but one copies of the process or at least avoid using them from multiple processes.

Related

SSL error: bad record mac - Managing multiple connections with PostgreSQL and Python multiprocessing

I recently switched some of the background functions of a software I am working on from using the Python threading module the multiprocessing module in order to take advantage of more CPU cores. Most of the transfer went without a hitch, but the database integration has been giving me significant issue.
Originally, I was using a single SimpleConnectionPool object from Python psycopg2 which sat as a global variable in a module called db that also handles some boilerplate database operations. From my understanding, creating a second Python process merely copies the current memory stack to the new process location. Because this causes issues with the database connections, I added a function called init() to my db module which simply re-initializes the SimpleConnectionPool and sets it to the global variable. My thinking was that if I called this init function from within a second process, it would create a new set of connections for only the pool on the secondary process. The main process would, therefore, maintain its own set of connections, separate from the second process.
However, using this method I was frequently getting the following exception:
OperationalError: SSL error: decryption failed or bad record mac
This originated directly from "state = conn.poll()" in psycopg2_patcher.py. I did a little digging and from what I can tell, the error is only thrown if both the main process and secondary process attempt to execute a query at the same time. I was thinking of just reverting back to one connection pool in the main process and using Queues to communicate queries from the secondary process to the main process for execution. This comes with a lot of headache though that I would rather avoid.
I also tried moving away from connection pools on the secondary process and used a single connection that is only established when a query needs to be executed, and then closed directly after. The same error occurred when the main process was trying to execute a query around the same time.
What do I need to do to the PGSQL server or my implementation to allow different processes to post queries simultaneously with the same credentials? I get the feeling I am going about the database connections between processes in a wholly unnecessary and convoluted way.

I have an elegant answer for this problem here. Where you can just specify the connection count and all of the async connections are handled for you. This is a modified version of the ThreadedConnectionPool.
Python Postgres psycopg2 ThreadedConnectionPool exhausted

memcache.get returns wrong object (Celery, Django)

Here is what we have currently:
we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
cache stores correct value (we checked that manually at the moment the bug just appeared)
calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
restarting celery solves the issue for the random period of time (minutes - weeks)
*NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
*NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted.
Isn't cache instance object returned by django's cache thread safe?
*NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
*NEW* At least one MemoryError was logged these days
I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.
PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)

Solved it finally:
Celery has dynamic scaling feature- it's capable to add/kill workers according to load
It does it via forking existing one
Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
OR store them under current PID, and check it each time you're accessing cache

This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.
You can easily reproduce this problem with a local memcached instance:
from django.core.cache import cache
import os
def write_read_test():
pid = os.getpid()
cache.set(pid, pid)
for x in range(5):
value = cache.get(pid)
if value != pid:
print "Unexpected response {} in process {}. Attempt {}/5".format(
value, pid, x+1)
os._exit(0)
cache.set("access cache", "before fork")
for x in range(5):
if os.fork() == 0:
write_read_test()
What you can do is close the cache client as Django does in the request_finished signal:
https://github.com/django/django/blob/master/django/core/cache/init.py#L128
If you put a cache.close() after the fork, everything works as expected.
For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().
This also affects gunicorn when preload is active and the cache is initialized before forking the workers.
For gunicorn, you could use post_fork in your gunicorn configuration:
def post_fork(server, worker):
from django.core.cache import cache
cache.close()

Segmentation fault error in a multi threaded app in python

I have a multi threaded app in python, wherein I create multiple producer threads and they extract the data from DB. Data is extracted in chunks. So the part where a thread creates sql statement with limit values is kept within lock. And to let threads execute queries simultaneously, query() function is kept outside the lock. Then the result fetching part is again kept under the lock. Below is the code snippet:
with UserAgent.lock:
sqlGeoTarget = "call sp_ax_ari_select_user_agent_list('0'," + str(self.chunkStart) + "," + str(self.chunkSize) + ",1);"
self.chunkStart += self.chunkSize
self.dbObj.query(sqlGeoTarget)
print "query executed. Processing data now..."+sqlGeoTarget
with UserAgent.lock:
result = self.dbObj.fetchAll()
self.dbObj.dbCursor.close()
But this code generates fatal error segmentation fault (core dumped). Because if I put all the code under lock, it executes fine. I explicitly close the cursor after fetching the data, it is reopened when query() function fired again.
This code is inside a class named UserAgent and it's a shared resource for a class named Producer. Thus, database object is shared. So the problem area 99% must be that as the db object is shared hitting query simultaneously and closing cursor then must be messing up with result set. But then how to solve this problem and achieve concurrent db query execution?

Do not reuse connections across threads. Create a new connection for each thread instead.
From the MySQLdb User Guide:
The MySQL protocol can not handle multiple threads using the same connection at once. Some earlier versions of MySQLdb utilized locking to achieve a threadsafety of 2. While this is not terribly hard to accomplish using the standard Cursor class (which uses mysql_store_result()), it is complicated by SSCursor (which uses mysql_use_result(); with the latter you must ensure all the rows have been read before another query can be executed. It is further complicated by the addition of transactions, since transactions start when a cursor execute a query, but end when COMMIT or ROLLBACK is executed by the Connection object. Two threads simply cannot share a connection while a transaction is in progress, in addition to not being able to share it during query execution. This excessively complicated the code to the point where it just isn't worth it.
The general upshot of this is: Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
Emphasis mine.
Use thread local storage or a dedicated connection pooling library instead.

SSL error after python/django fork

I've got a python django app where part of it is parsing a large file. This takes forever, so I put a fork in to deal with the processing, allowing the user to continue to browse the site. Within the fork code, there's a bunch of calls to our postgres database, hosted on amazon.
I'm getting the following error:
SSL error: decryption failed or bad record mac
Here's the code:
pid = os.fork()
if pid == 0:
lengthy_code_here(long)
database_queries(my_database)
os._exit(0)
None of my database calls are working, although they were working just fine before I inserted the fork. After looking around a little, it seems like it might be a stale database connection, but I'm not sure how to fix it. Does anyone have any ideas?

Forking while holding a socket open (such as a database connection) is generally not safe, as both processes will end up trying to use the same socket at once.
You will need, at a minimum, to close and reopen the database connection after forking.
Ideally, though, this is probably better suited for a task queueing system like Celery.

Django in production typically has a process dispatching to a bunch of processes that house django/python. These processes are long running, ie. they do NOT terminate after handling one request. Rather they handle a request, and then another, and then another, etc. What this means is changes that are not restored/cleaned up at the end of servicing a request will affect future requests.
When you fork a process, the child inherits various things from the parent including all open descriptors (file, queue, directories). Even if you do nothing with the descriptors, there is still a problem because when a process dies all it's open descriptors will be cleaned up.
So when you fork from a long running process you are setting yourself up to close all the open descriptors (such as the ssl connection) when the child process dies after it finishes processing. There are ways to prevent this from happening in a fork, but they can sometimes be difficult to get right.
A better design is to not fork, and instead hand off to another process that is either running, or started in a safer manner. For example:
at(1) can be used to queue up jobs for later (or immediate) execution
message queues can be used to pass messages to other daemons
standard IPC constructs such as pipes can be used to communicate to other daemons
update:
If you want to use at(1) you will have to create a standalone script. You can use a serializer to pass the data from django to the script.

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)

You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.

The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.

Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.

You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.

Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.

You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.

Use threading.Lock()

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!

Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.

You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety

The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.