I've noticed this behavior in sqlite. When I re-use the cursor object, the working set memory in the task manager keeps increasing until my program throws a out of memory exception.
I refactored the code such that each time I query I open a connection to the sqlite file query what I want and then close the connection.
The latter somehow seems to be not so memory-hungry. It doesn't increase beyond a certain point.
All I do with my sqlite db is simple select (which contains two aggregations) against a table.
Is this a behavior we can somehow control? I'd want to reuse my cursor object, but not want memory to be eaten up...
See SQLite: PRAGMA cache_size
By default the cache size is fairly big (~2MB) and that is per-connection. You can set it smaller with the SQL statement:
PRAGMA cache_size=-KB
Use the negative '-' KB value to set it as a size in KiloBytes, otherwise it set the number of pages to use.
Also, if using multiple connections you might want to employ a shared cache to save memory:
SQLITE: Shared Cache
Yes, the sqlite3 module for Python uses a statement cache.
You can read about the cached_statements parameter here.
More info on this issue.
Related
I wish to run some benchmarks on different databases that I have. I repeat every query n times so that I can report average query times. I am aware that SQLite caches statements, as per the documentation:
The sqlite3 module internally uses a statement cache to avoid SQL
parsing overhead. If you want to explicitly set the number of
statements that are cached for the connection, you can set the
cached_statements parameter. The currently implemented default is to
cache 100 statements.
However, it is unclear to me whether this cache persists: so in short, does the SQLite cache persists (1) within a Python session even after closing the connection? (2) Across Python sessions (so cache written to disk?)?
My code looks something like this:
times = []
for i in range(n_repeat):
start = time.perf_counter()
conn = sqlite3.connect(dbpath)
# do query
conn.commit()
conn.close()
times.append(time.perf_counter() - start)
return timedelta(seconds=sum(times) / n_repeat)
My assumption was that whenever I close the connection, any and all caching would be discarded and garbage-collected immediately. I find little variance in the n runs (no difference between 1st and nth iteration), so I would think that my assumption is correct. But I'd rather be sure so I am asking here.
tl;dr: does SQLite cache queries even after a connection has closed? And does it cache queries across Python sessions?
After some digging I found that apart from caching within Python/SQLite (as Shawn mentions in the comments) a huge impactful factor was the OS-wide caching that occurs. I am not sure what exactly is cached (the database, the indices, the query itself) but if I delete system caches after every iteration (so no caching should occur) then I find that consecutive calls are much slower (in the order of 100x or more!).
So, yes SQLite is caching statements through Python but after closing the database these are released. But in addition, system caches will play a huge role.
In the case of sqlite, it is not clear whether we can easily commit right after each dataframe insert. (Assuming that auto-commit is off by default, following the python database wrapping convention).
Using the simplest sqlalchemy api flow ―
db_engine = db.create_engine()
for .....
# slowly compute some_df, takes a lot of time
some_df.to_sql(con = db_engine)
How can we make sure that every .to_sql is committed?
For motivation, imagine the particular use case being that each write reflects the result of a potentially very long computation and we do not want to lose a huge batch of such computations nor any single one of them, in case a machine goes down or in case the python sqlalchemy engine object is garbage collected before all its writes have actually drained in the database.
I believe auto-commit is off by default, and for sqlite, there is no way of changing that in the create_engine command. What might be the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write ― when using the simplistic .to_sql api?
Or must the code be refactored to use a different api flow to accomplish that?
You can set the connection to autocommit by:
db_engine = db_engine.execution_options(autocommit=True)
From https://docs.sqlalchemy.org/en/13/core/connections.html#understanding-autocommit:
The “autocommit” feature is only in effect when no Transaction has otherwise been declared. This means the feature is not generally used with the ORM, as the Session object by default always maintains an ongoing Transaction.
In your code you have not presented any explicit transactions, and so the engine used as the con is in autocommit mode (as implemented by SQLA).
Note that SQLAlchemy implements its own autocommit that is independent from the DB-API driver's possible autocommit / non-transactional features.
Hence the "the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write" is what you already had, unless to_sql() emits some funky statements that SQLA does not recognize as data changing operations, which it has not, at least of late.
It might be that the SQLA autocommit feature is on the way out in the next major release, but we'll have to wait and see.
I'm using the python SQlite driver and trying to stream read from Table1 putting each row through a python function which computes something and creates another object that I am writing to another table, Table2.
But I'm running locking issues. Shouldn't SQLite be able to so this easily? Is there some special mode I need to turn on?
I can get this to work if I read the whole stream into memory first and then write the other table by looping over the list but that isn't streaming and has issues if Table1 can't fit into memory. Isn't there a way to permit this basic kind of streaming operation?
Update
I tried the following and perhaps it's the answer
db = sqlite3.connect(file, timeout=30.0, check_same_thread=False,
isolation_level=None)
db.execute('pragma journal_mode=wal;')
That is I added isolation_level=None and the pragma command. This puts it into WAL (write ahead logger) mode. It seems to avoid the locking issue for my use case anyway.
Sqlite does not lock tables but the whole database. So if I understand you correctly it will not be possible with sqlite.
For reference take a look at sqlite locks.
It mentiones:
An EXCLUSIVE lock is needed in order to write to the database file. Only one EXCLUSIVE lock is allowed on the file and no other locks of any kind are allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency, SQLite works to minimize the amount of time that EXCLUSIVE locks are held.
A possible workaround might be to create one transaction for all your reading and writing operations. Sqlite will only lock the database when the transaction is actually commited. See here for transactions.
My colleague run a script that pulls data from the db periodically. He is using the query:
SELECT url, data FROM table LIMIT {} OFFSET {}'.format( OFFSET, PAGE * OFFSET
We use Amazon AURORAS and he has his own slaves server but everytime it touches 98%+
Table have millions of records.
Would it be nice if we go for sqldump instead of SQL queries for fetching data?
The options come in my mind are:
SQL DUMP of selective tables( not sure of benchmark)
Federate tables based on certain reference(date, ID etc)
Thanks
I'm making some fairly big assumptions here, but from
without choking it
I'm guessing you mean that when your colleague runs the SELECT to grab the large amount of data, the database performance drops for all other operations - presumably your primary application - while the data is being prepared for export.
You mentioned SQL Dump so I'm also assuming that this colleague will be satisfied with data that is roughly correct, ie: it doesn't have to be up to the instant transactionally correct data. Just good enough for something like analytics work.
If those assumptions are close, your colleague and your database might benefit from
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
This line of code should be used carefully and almost never in a line of business application but it can help people querying the live database with big queries, as long as you fully understand the implications.
To use it, simply start a transaction and put this line before any queries you run.
The 'choking'
What you are seeing when your colleague runs a large query is record locking. Your database engine is - quite correctly - set up to provide an accurate view of your data, at any point. So, when a large query comes along the database engine first waits for all write locks (transactions) to clear, runs the large query and holds all future write locks until the query has run.
This actually happens for all transactions, but you only really notice it for the big ones.
What READ UNCOMMITTED does
By setting the transaction isolation level to READ UNCOMMITTED, you are telling the database engine that this transaction doesn't care about write locks and to go ahead and read anyway.
This is known as a 'dirty read', in that the long-running query could well read a table with a write lock on it and will ignore the lock. The data actually read could be the data before the write transaction has completed, or a different transaction could start and modify records before this query gets to it.
The data returned from anything with READ UNCOMMITTED is not guaranteed to be correct in the ACID sense of a database engine, but for some use cases it is good enough.
What the effect is
Your large queries magically run faster and don't lock the database while they are running.
Use with caution and understand what it does before you use it though.
MySQL Manual on transaction isolation levels
I'm writing a script in python which basically queries WMI and updates the information in a mysql database. One of those "write something you need" to learn to program exercises.
In case something breaks in the middle of the script, for example, the remote computer turns off, it's separated out into functions.
Query Some WMI data
Update that to the database
Query Other WMI data
Update that to the database
Is it better to open one mysql connection at the beginning and leave it open or close the connection after each update?
It seems as though one connection would use less resources. (Although I'm just learning, so this is a complete guess.) However, opening and closing the connection with each update seems more 'neat'. Functions would be more stand alone, rather than depend on code outside that function.
"However, opening and closing the connection with each update seems more 'neat'. "
It's also a huge amount of overhead -- and there's no actual benefit.
Creating and disposing of connections is relatively expensive. More importantly, what's the actual reason? How does it improve, simplify, clarify?
Generally, most applications have one connection that they use from when they start to when they stop.
I don't think that there is "better" solution. Its too early to think about resources. And since wmi is quite slow ( in comparison to sql connection ) the db is not an issue.
Just make it work. And then make it better.
The good thing about working with open connection here, is that the "natural" solution is to use objects and not just functions. So it will be a learning experience( In case you are learning python and not mysql).
Think for a moment about the following scenario:
for dataItem in dataSet:
update(dataItem)
If you open and close your connection inside of the update function and your dataSet contains a thousand items then you will destroy the performance of your application and ruin any transactional capabilities.
A better way would be to open a connection and pass it to the update function. You could even have your update function call a connection manager of sorts. If you intend to perform single updates periodically then open and close your connection around your update function calls.
In this way you will be able to use functions to encapsulate your data operations and be able to share a connection between them.
However, this approach is not great for performing bulk inserts or updates.
Useful clues in S.Lott's and Igal Serban's answers. I think you should first find out your actual requirements and code accordingly.
Just to mention a different strategy; some applications keep a pool of database (or whatever) connections and in case of a transaction just pull one from that pool. It seems rather obvious you just need one connection for this kind of application. But you can still keep a pool of one connection and apply following;
Whenever database transaction is needed the connection is pulled from the pool and returned back at the end.
(optional) The connection is expired (and of replaced by a new one) after a certain amount of time.
(optional) The connection is expired after a certain amount of usage.
(optional) The pool can check (by sending an inexpensive query) if the connection is alive before handing it over the program.
This is somewhat in between single connection and connection per transaction strategies.