Performance lost when open a db multiple times in BerkeleyDB

Performance lost when open a db multiple times in BerkeleyDB - python

I'm using BerkeleyDB to develop a small app. And I have a question about opening a database multiple time in BDB.
I have a large set of text ( corpus ), and I want to load a part of it to do the calculation. I have two pseudo-code (mix with python) here
#1
def getCorpus(token):
DB.open()
DB.get(token)
DB.close()
#2
#open and wait
def openCorpus():
DB.open()
#close database
def closeCorpus():
DB.close()
def getCorpus(token):
DB.get(token)
In the second example, I'll open the db before the calculation, load token for each loop and then close the db.
In the first example, each time the loop ask for the token, I'll open, get and then close the db.
Is there any performance lost ?
I also note that I'm using a DBEnv to manage the database

If you aren't caching the opened file you will always get performance lost because:
you call open() and close() multiple times which are quite expensive,
you lose all potential buffers (both system buffers and bdb internal buffers).
But I wouldn't care too much about the performance before the code is written.

Related

pyodbc connection.close() very slow with Access Database

I am using pyodbc to open a connection to a Microsoft Access database file (.accdb) and run a SELECT query to create a pandas dataframe. I acquired the ODBC drivers from here. I have no issues with retrieving and manipulating the data but closing the connection at the end of my code can take 10-15 seconds which is irritating. Omitting conn.close() fixes my problem and prior research indicates that it's not critical to close my connection as I am the only one accessing the file. However, my concern is that I may have unexpected hangups down the road when I integrate the code into my tkinter gui.
import pyodbc
import pandas as pd
import time
start = time.time()
db_fpath = 'C:\mydb.accdb'
conn = pyodbc.connect(r'Driver={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={0};'.format(db_fpath))
pd.read_sql_query('SELECT * FROM table',conn)
conn.close()
print(time.time()-start)
I get the following results:
15.27361798286438 # with conn.close()
0.4076552391052246 # without conn.close()
If I wrap the code in a function but omit the call to conn.close(), I also encounter a hangup which makes me believe that whenever I release the connection from memory, it will cause a slowdown.
If I omit the SQL query (open the connection and then close it without doing anything), there is no hangup.
Can anyone duplicate my issue?
Is there something I can change about my connection or method of closing to avoid the slowdown?
EDIT: After further investigation, Dropbox was not the issue. I am actually using pd.read_sql_query() 3 times in succession using conn to import 3 different tables from my database. Trying different combinations of closing/not closing the connection and reading different tables (and restarting the kernel between tests), I determined that only when I read one specific table can I cause the connection closing to take significantly longer. Without understanding the intricacies of the ODBC driver or what's different about that table, I'm not sure I can do anything more. I would upload my database for others to try but it contains sensitive information. I also tried switching to pypyodbc to no effect. I think the original source of the table was actually a much older .mdb Access Database so maybe remaking that table from scratch would solve my issue.
At this point, I think the simplest solution is just to maintain the connection object in memory always to avoid closing it. My initial testing indicates this will work out although it is a bit of a pain.

speed improvement in postgres INSERT command

I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?
Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.

As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.

Python/Hive interface slow with fetchone(), hangs with fetchall()

I have a python script that is querying HiveServer2 using pyhs2, like so:
import pyhs2;
conn = pyhs2.connect(host=localhost,
port=10000,
user='user',
password='password',
database='default');
cur = conn.cursor();
cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC");
line = cur.fetchone();
while line is not None:
<do some processing, including writing to stdout>
.
.
.
line = cur.fetchone();
I have also tried using fetchall() instead of fetchone(), but that just seems to hang forever.
My query runs just fine and returns ~270 million rows. For testing, I dumped the output from Hive into a flat, tab-delimited file and wrote the guts of my python script against that, so I didn't have to wait for the query to finish everytime I ran. My script that reads the flat file will finish in ~20 minutes. What confuses me is that I don't see that same performance when I directly query Hive. In fact, it takes about 5 times longer to finish processing. I am pretty new to Hive, and python so maybe I am making some bone-headed error, but examples that I see online show a set up such as this. I just want to iterate through my Hive return, getting one row at a time as quickly as possible, much like I did using my flat file. Any suggestions?
P.S. I have found this question that sounds similar:
Python slow on fetchone, hangs on fetchall
but that ended up being a SQLite issue, and I have no control over my Hive set up.

Have you considered using fetchmany().
That would be the DBAPI answer for pulling data in chunks (bigger one, where the overhead is an issue, and smaller than all rows, where memory is an issue).

psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?
import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.
Instead of
cursor = conn.cursor()
write
cursor = conn.cursor(name="my_cursor_name")
or simpler yet
cursor = conn.cursor("my_cursor_name")
The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors
I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include
"DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.
The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.
import datetime as dt
import psycopg2
import sys
import time
conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server
curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
cptLigne += 1
if cptLigne % 10000 == 0:
print('.', end='')
sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()
As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

Please see the next answer by #joeblog for the better solution.
First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.
Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.
Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.
Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.
What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.
The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.
It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.
See also:
python - memory not being given back to kernel
Why doesn't memory get released to system after large queries (or series of queries) in django?
Releasing memory in Python
So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

Why doesn't this loop display an updated object count every five seconds?

I use this python code to output the number of Things every 5 seconds:
def my_count():
while True:
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
my_count()
If another process generates a new Thing while my_count() is running, my_count() will keep printing the same number, even though it now has changed in the database. (But if I kill my_count() and restart it, it will display the new Thing count.)
Things are stored in a MYSQL innodb database, and this code runs on ubuntu.
Why won't my_count() display the new Thing.objects.count() without being restarted?

Because Python DB API is by default in AUTOCOMMIT=OFF mode, and (at least for MySQLdb) on REPEATABLE READ isolation level. This means that behind the scenes you have an ongoing database transaction (InnoDB is transactional engine) in which the first access to given row (or maybe even table, I'm not sure) fixes "view" of this resource for the remaining part of the transaction.
To prevent this behaviour, you have to 'refresh' current transaction:
from django.db import transaction
#transaction.autocommit
def my_count():
while True:
transaction.commit()
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
-- note that the transaction.autocommit decorator is only for entering transaction management mode (this could also be done manually using transaction.enter_transaction_management/leave_transaction_managemen functions).
One more thing - to be aware - Django's autocommit is not the same autocommit you have in database - it's completely independent. But this is out of scope for this question.
Edited on 22/01/2012
Here is a "twin answer" to a similar question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.