SSL syscall error bad file descriptor using sqlalchemy and postgres - python

So I have a daemon process that talks to Postgres via sqlalchemy. The daemon does something like this:
while True:
oEngine = setup_new_engine()
with oEngine.connect() as conn:
Logger.debug("connection established")
DBSession = sessionmaker(bind=conn)()
Logger.debug('DBSession created. id={0}'.format(id(DBSession)))
#do a bunch of stuff with DBSession
DBSession.commit()
Logger.debug('DBSession committed. id={0}'.format(id(DBSession)))
On the first iteration of the forever loop everything works great. For a while. The DBSession successfully makes a few queries to the database. But then one query fails with the error:
OperationalError: (OperationalError) SSL SYSCALL error: Bad file descriptor
This speaks to me of a closed connection or file descriptor being used. But the connections are created and maintained by the daemon so I don't know what this means.
In other words what happens is:
create engine
open connection
setup dbsession
query dbsession => works great
query dbsession => ERROR
The query in question looks like:
DBSession.query(Login)
.filter(Login.LFTime == oLineTime)
.filter(Login.success == self.success)
.count()
which seems perfectly reasonable to me.
My question is: What kinds of reasons could there be for this kind of behaviour and how can I fix it or isolate the problem?
Let me know if you need more code. There is a heck of a lot of it so I went for the minimalist approach here...

I fixed this by thinking about the session scope instead of the transaction scope.
while True:
do_stuff()
def do_stuff():
oEngine = setup_new_engine()
with oEngine.connect() as conn:
Logger.debug("connection established")
DBSession = sessionmaker(bind=conn)()
#do a bunch of stuff with DBSession
DBSession.commit()
DBSession.close()
I would still like to know why this fixed things though...

You are creating the session inside your while loop, which is very ill-advised. With the code the way you had it the first time, you would spawn off a new connection at every iteration and leave it open. Before too long, you would be bound to hit some kind of limit and be unable to open yet another new session. (What kind of limit? Hard to say, but it could be a memory condition since DB connections are pretty weighty; it could be a DB-server limit where it will only accept a certain number of simultaneous user connections for performance reasons; hard to know and it doesn't really matter, because whatever the limit was, it has prevented you from using a very wasteful approach and hence has worked as intended!)
The solution you have hit upon fixes the problem because, as you open a new connection with each loop, so you also close it with each loop, freeing up the resources and allowing additional loops to create sessions of their own and succeed. However, this is still a lot of unnecessary busyness and a waste of processing resources on both the server and the client. I suspect it would work just as well-- and potentially be a lot faster-- if you move the sessionmaker outside the while loop.
def main():
oEngine = setup_new_engine()
with oEngine.connect() as conn:
Logger.debug("connection established")
DBSession = sessionmaker(bind=conn)()
apparently_infinite_loop(DBSession)
# close only after we are done and have somehow exited the infinite loop
DBSession.close()
def apparently_infinite_loop(DBSession):
while True:
#do a bunch of stuff with DBSession
DBSession.commit()
I don't currently have a working sqlalchemy setup, so you likely have some syntax errors in there, but anyway I hope it makes the point about the fundamental underlying issue.
More detail is available here: http://docs.sqlalchemy.org/en/rel_0_9/orm/session.html#session-faq-whentocreate
Some points from the docs to note:
"The Session will begin a new transaction if it is used again". So this is why you don't need to be constantly opening new sessions in order to get transaction scope; a commit is all it takes.
"As a general rule, the application should manage the lifecycle of the session externally to functions that deal with specific data." So your fundamental problem originally (and still) is all of that session management going on right down there inside the while loop right alongside your data processing code.

Related

How can i find where roll back invoked?

Now, I use Amazon RDS, lambda, python and sqlalchemy. when I checked amazon rds performance insights, I find some rollback invoked. rollback is invoked so far.
But when i excute other query in insights, there are not error.
How can i find where is rollback invoked? or why is rollback invoked?
I doubt wrong query. so, I tried to send same query that i found query in performance insights. but there are no rollback.
I doubt traffic issue. So, I tried to send many same query about (1000000) using 'for' and 5 terminal at the same time. After I check show processlist. but there are no rollback.
I heard sqlalchemy.create_engine use connection pool and when connection close, sqlalchemy invoked rollback. but I don't know, How can i check this issue and this issue is solution of this problem.
this is a my rds performance insights
Rollbacks can originate from either rolling back a transaction to unwind queries, or upon returning a connection to the pool.
One way that you could get a feel for what your app is doing would be to hook into those rollback actions through the event system to enable some tracking.
There are two events that you'd need to look at:
ConnectionEvents.rollback:
Intercept rollback() events, as initiated by a Transaction.
PoolEvents.reset:
Called before the “reset” action occurs for a pooled connection.
You could set listeners on these events that increment some counters, or perform some logging that is specific to counting the number of rollbacks. Then you'd be able to get a feel for the relative weight of transaction rollbacks vs pool rollbacks.
E.g. using some crude global counters but you can add whatever logic that you need:
import logging
from sqlalchemy import event
POOL_ROLLBACKS = 0
TXN_ROLLBACKS = 0
#event.listens_for(YourEngine, 'reset')
def receive_reset(dbapi_connection, connection_record):
POOL_ROLLBACKS += 1
logging.debug(f"Pool rollback count: {POOL_ROLLBACKS}")
#event.listens_for(YourEngine, 'rollback')
def receive_rollback(conn):
# track a transaction based rollback
TXN_ROLLBACKS += 1
logging.debug(f'Transaction rollback count {TXN_ROLLBACKS}')

Pandas leaving idle Postgres connections open after to_sql?

I am doing a lot of ETL with Pandas and Postgres. I have a ton of idle connections, many marked with COMMIT and ROLLBACK, that I am not sure how to prevent from sitting as idle for long periods rather than closing. The main code I use to write to the database is using pandas to_sql:
def write_data_frame(self, data_frame, table_name):
engine = create_engine(self.engine_string)
data_frame.to_sql(name=table_name, con=engine, if_exists='append', index=False)
I know this is definitely not best practice for PostgreSQL and I should be doing something like passing params to a Stored Procedure or Function or something, but this is how we are setup to get data_frames from non-Postgres databases / data sources and upload to Postgres.
My pgAdmin looks like this:
Can someone please point me in the right direction of how to avoid this many idle connections in the future? Some of our database connections are meant to be long-lived as they are continuous "batch" processes. But it seems like some one-off events are leaving connections open and idle.
Using the engine as a one-off is probably not ideal for you. If possible, you could make the engine a member of the class and call it as self.engine.
Another option would be to explicitly dispose of the engine.
def write_data_frame(self, data_frame, table_name):
engine = create_engine(self.engine_string)
data_frame.to_sql(name=table_name, con=engine, if_exists='append', index=False)
engine.dispose()
As noted in the docs,
This has the effect of fully closing all currently checked in database connections. Connections that are still checked out will not be closed, however they will no longer be associated with this Engine, so when they are closed individually, eventually the Pool which they are associated with will be garbage collected and they will be closed out fully, if not already closed on checkin.
This may also be a good use case for a try...except...finally block since .dispose will only be called when the preceding code executes without error.
I would much rather be suggesting to you that you pass connections like so:
with engine.connect() as connection:
data_frame.to_sql(..., con=connection)
But the to_sql docs indicate that you can't do that and they will only accept an engine

Segmentation fault error in a multi threaded app in python

I have a multi threaded app in python, wherein I create multiple producer threads and they extract the data from DB. Data is extracted in chunks. So the part where a thread creates sql statement with limit values is kept within lock. And to let threads execute queries simultaneously, query() function is kept outside the lock. Then the result fetching part is again kept under the lock. Below is the code snippet:
with UserAgent.lock:
sqlGeoTarget = "call sp_ax_ari_select_user_agent_list('0'," + str(self.chunkStart) + "," + str(self.chunkSize) + ",1);"
self.chunkStart += self.chunkSize
self.dbObj.query(sqlGeoTarget)
print "query executed. Processing data now..."+sqlGeoTarget
with UserAgent.lock:
result = self.dbObj.fetchAll()
self.dbObj.dbCursor.close()
But this code generates fatal error segmentation fault (core dumped). Because if I put all the code under lock, it executes fine. I explicitly close the cursor after fetching the data, it is reopened when query() function fired again.
This code is inside a class named UserAgent and it's a shared resource for a class named Producer. Thus, database object is shared. So the problem area 99% must be that as the db object is shared hitting query simultaneously and closing cursor then must be messing up with result set. But then how to solve this problem and achieve concurrent db query execution?
Do not reuse connections across threads. Create a new connection for each thread instead.
From the MySQLdb User Guide:
The MySQL protocol can not handle multiple threads using the same connection at once. Some earlier versions of MySQLdb utilized locking to achieve a threadsafety of 2. While this is not terribly hard to accomplish using the standard Cursor class (which uses mysql_store_result()), it is complicated by SSCursor (which uses mysql_use_result(); with the latter you must ensure all the rows have been read before another query can be executed. It is further complicated by the addition of transactions, since transactions start when a cursor execute a query, but end when COMMIT or ROLLBACK is executed by the Connection object. Two threads simply cannot share a connection while a transaction is in progress, in addition to not being able to share it during query execution. This excessively complicated the code to the point where it just isn't worth it.
The general upshot of this is: Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
Emphasis mine.
Use thread local storage or a dedicated connection pooling library instead.

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?
Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)
You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.
The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.
Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.
You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.
Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.
You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.
Use threading.Lock()
I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!
Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)
I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.
I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).
Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety
The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

Mysql Connection, one or many?

I'm writing a script in python which basically queries WMI and updates the information in a mysql database. One of those "write something you need" to learn to program exercises.
In case something breaks in the middle of the script, for example, the remote computer turns off, it's separated out into functions.
Query Some WMI data
Update that to the database
Query Other WMI data
Update that to the database
Is it better to open one mysql connection at the beginning and leave it open or close the connection after each update?
It seems as though one connection would use less resources. (Although I'm just learning, so this is a complete guess.) However, opening and closing the connection with each update seems more 'neat'. Functions would be more stand alone, rather than depend on code outside that function.
"However, opening and closing the connection with each update seems more 'neat'. "
It's also a huge amount of overhead -- and there's no actual benefit.
Creating and disposing of connections is relatively expensive. More importantly, what's the actual reason? How does it improve, simplify, clarify?
Generally, most applications have one connection that they use from when they start to when they stop.
I don't think that there is "better" solution. Its too early to think about resources. And since wmi is quite slow ( in comparison to sql connection ) the db is not an issue.
Just make it work. And then make it better.
The good thing about working with open connection here, is that the "natural" solution is to use objects and not just functions. So it will be a learning experience( In case you are learning python and not mysql).
Think for a moment about the following scenario:
for dataItem in dataSet:
update(dataItem)
If you open and close your connection inside of the update function and your dataSet contains a thousand items then you will destroy the performance of your application and ruin any transactional capabilities.
A better way would be to open a connection and pass it to the update function. You could even have your update function call a connection manager of sorts. If you intend to perform single updates periodically then open and close your connection around your update function calls.
In this way you will be able to use functions to encapsulate your data operations and be able to share a connection between them.
However, this approach is not great for performing bulk inserts or updates.
Useful clues in S.Lott's and Igal Serban's answers. I think you should first find out your actual requirements and code accordingly.
Just to mention a different strategy; some applications keep a pool of database (or whatever) connections and in case of a transaction just pull one from that pool. It seems rather obvious you just need one connection for this kind of application. But you can still keep a pool of one connection and apply following;
Whenever database transaction is needed the connection is pulled from the pool and returned back at the end.
(optional) The connection is expired (and of replaced by a new one) after a certain amount of time.
(optional) The connection is expired after a certain amount of usage.
(optional) The pool can check (by sending an inexpensive query) if the connection is alive before handing it over the program.
This is somewhat in between single connection and connection per transaction strategies.

Categories