speed improvement in postgres INSERT command - python

I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?
Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.

As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.

Related

Psycopg2 - Insert large data leads to server closed unexpectedly

I've tried to see other similar problems, but they did not quite give me the answer I was looking for. My main goal is to store a large dataset into a google cloud. I tried to store around 1000 insert statement to google cloud and it went well.
In the other hand, storing over 200.000 insert statement proved to be challenge than I thought. In my code, I have this function:
def insert_to_gcloud(information):
db = psycopg2.connect(database="database",
user="user",
password="password",
host="host",
port=1234)
cur = db.cursor()
cur.execute("".join(information))
db.commit()
db.close()
I use batch insert (suppose to be faster) where the first index contain the insert statement and the rest values. Then I use "".join() to make it into a string. To make a simple example:
INSERT INTO tbl_name (a,b,c) VALUES (1,2,3),(4,5,6),(7,8,9);
["INSERT INTO tbl_name (a,b,c) VALUES",
"(1,2,3)," ,
"(4,5,6)," ,
"(7,8,9);" ]
At first, I tried to execute 200.000 insert statement, but I got an error about EOF after about 1 min. I guess it was to big to send, so I made a chunk function. Basically, it divides the array into proper size as I want it. (reference)
Then used a simple for-loop to store each one block at a time:
for i in range(len(chunk_block)):
insert_to_gcloudd(chunk_block[i])
It seems like it was working, but I let it run over the night, used 237 min, I got this message:
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
My next test is to store the chunk block into files, read and log the files. If the server gets closed unexpectedly again, I can check the log. Thought, this is not a good way to do it in my opinion, but I'm out of ideas here.
So to my question, is there any option I could try? Maybe there are other tools I can use? 🙂

Python script stops, no errors giving

I have an python script that needs to be running all the time. Sometimes it can run for a hole day, sometimes it only runs for like an hour.
import RPi.GPIO as GPIO
import fdb
import re
con = fdb.connect(dsn='10.100.2.213/3050:/home/trainee2/Desktop/sms', user='sysdba', password='trainee') #connect to database
cur = con.cursor() #initialize cursor
pinnen = [21,20,25,24,23,18,26,19,13,6,27,17] #these are the GPIO pins we use, they are the same on all PI's! We need them in this sequence.
status = [0] * 12 #this is an empty array were we'll safe the status of each pin
ids = []
controlepin = [2] * 12 #this array will be the same as the status array, only one step behind, we have this array so we can know where a difference is made so we can send it
GPIO.setmode(GPIO.BCM) #Initialize GPIO
getPersonIDs() #get the ids we need
for p in range(0,12):
GPIO.setup(pinnen[p],GPIO.IN) #setup all the pins to read out data
while True: #this will repeat endlessly
for e in range(0,12):
if ids[e]: #if there is a value in the ids (this is only neccesary for PI 3 when there are not enough users
status[e] = GPIO.input(pinnen[e]) #get the status of the GPIO. 0 is dark, 1 is light
if (status[e] != controlepin[e]): #if there are changes
id = ids[e]
if id != '': #if the id is not empty
if status[e] == 1: #if there is no cell phone present
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,0)",(id)) #SEND 0, carefull! Status 0 sends 1, status 1 sends 0 to let it make sense in the database!!
else :
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,1)",(id))
con.commit() #commit your query
controlepin[e] = status[e] #safe the changes so we woulnd't spam our database
time.sleep(1) #sleep for one second, otherwise script will crash cause of while true
def getPersonIDs(): #here we get the IDS
cur.execute("SELECT first 12 A.F_US_ID FROM T_RACK_SLOTS a order by F_RS_ID;") #here is where the code changes for each pi
for (ID) in cur:
ids.append(ID) #append all the ids to the array
The script is used for a cellphone rack, through LDR's I can see if a cellphone is present, then I send that data to a Firebird database. The scripts are running om my Raspberry PI's.
Can it be that that the script just stops if the connection is lost for a few seconds? Is there a way to make sure they query's are always send?
Can it be that that the script just stops if the connection is lost for a few seconds?
More so, the script IS stopping for every Firebird command, including con.commit() and it only continues when Firebird processes the command/query.
So, not knowing much of Python libraries I would still give you some advices.
1) use parameters and prepared queries as much as you can.
if status[e] == 1: #if there is no cell phone present
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,0)",(id)) #SEND 0, carefull! Status 0 sends 1, status 1 sends 0 to let it make sense in the database!!
else :
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,1)",(id))
That is not the best idea. You force Firebird engine to parse the query text and build the query again and again. Waste of time and resources.
The correct approach is to make INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (?,?) query, then prepare it, and then run the already prepared query changing the parameters. You would only prepare it once, before the loop, and then would run it many times.
Granted, I do not know how to prepare queries in the Python library, but I think you'd find the examples.
2) do not use SQL server for saving every single data element you get. It is a known mal-practice, that was suggested again decade ago. Especially with lazy versioned engine Interbase/Firebird is.
The thing is, with every your statement Firebird checks some internal statistics and sometimes it decides time came to do housekeeping.
For example, your select statement is akin for garbage collection. Firebird might stop for scanning all the table, find the orphaned obsolete versions of rows and clear them away. For example your insert statement is akin for index recreation: if Firebird would think the B-Tree of the index is got too one-sided, it would drop it, and build a new balanced tree, reading out the whole table ( and yes, reading the table may provoke GC on top of tree recreation ).
More so, let as steer away from Firebird specifics - what would you do ig Firebird crashes? Just crashes, it is a program, and like every program it may have bugs. Or for example you run out of disk space and Firebird can no more insert anything into the database - where would your hardware sensors data end in then? Won't it just be lost ?
http://www.translate.ru - this one works usually better than Google or Microsoft translation, especially if you set the vocabulary to computers.
See #7 at http://www.ibase.ru/dontdoit/ - "Do not issue commit after every single row". #24 at https://www.ibase.ru/45-ways-to-improve-firebird-performance-russian/ suggests committing packets of about a thousand rows as a sweet spot between to many transactions and too much uncommitted data. Also check #9, #10, #16 and #17 and #44 at the last link.
The overall structure of your software complex I believe has to be split into two services.
Query data from hardware sensors and save it to plain stupid binary flat file. Since this file is the most simplistic format that can be - the performance and reliability would be maxed.
Take ready binary files, and insert them into SQL database in bulk insert mode.
So, for example, you set the threshold of 10240 rows.
The service #1 creates the file "Phones_1_Processing" with BINARY well-defined format. It also creates and opens "Phones_2_Processing" file, but keeps it at 0 length. Then it keeps adding rows into the "Phones_1_Processing" for a while. It might also flush OS file buffers after every row, or every 100 rows, or something that would get best balance between reliability and performance.
When the threshold is met, the service #1 switches into recording incoming data cells into the already created and opened "Phones_2_Processing" file. It can be done instantly, change one file handler type variable in your program.
Then the service #1 closes and renames "Phones_1_Processing" into ""Phones_1_Complete".
Then the service #1 creates new empty file "Phones_3_Processing" and keeps it open with zero length. Now it is back at state "1" - ready to instantly switch its recording into new file, when the current file is over.
The key points here is that the service should only do most simple and most fast operations. Since any random delay would mean your realtime-generated data is lost and would never be recovered. BTW, how can you disable Garbage Collection in Python, so it would not "stop the world" suddenly? Okay, half-joking. But the point is kept. GC is random non-deterministic bogging down of your system, and it is badly compatible with regular non-buffered hardware data acquisition. That primary acquisition of non-buffered data is better be done with most simple=predictable services, and GC is a good global optimization, but the price is it tends to generate sudden local no-service spikes.
As this all happens with Service #1 with have another one.
Service #2 keeps monitoring data changes in the folder you use to save primary data. It subscribes to "some file was renamed" events and ignores others. Which service to use? ask Linux guru. iNotify, dNotify, FAM, Gamin, anything of a kind that would suit.
When Service #2 is awaken with "file was renamed and xxxx is new name" it checks if the new file name ends with "_Complete". If it does not - then that was a bogus event.
When the event is for a new "Phone_...._Complete" file, then it is time to "bulk insert" it into Firebird. Google for "Firebird bulk insert", for example http://www.firebirdfaq.org/faq209/
The Service #2 renames "Phone_1_Complete" into "Phone_1_Inserting", so the state of data packet is persisted (as file name).
The Service #2 attaches this file into Firebird database as an EXTERNAL TABLE
The Service #2 proceeds with bulk insert, as described above. De-activating indexes, it opens a no-auto-undo transaction and keeps pumping the rows from the External Table into the destination table. If the service or server crashes here - you have a consistent state: transaction gets rolled back and the file name shows it still is pending to be inserted.
When all the rows are pumped - frankly, if Python can work with binary files, it would be a single INSERT-FROM-SELECT, - you commit the transaction, delete the External Table (detaching firebird from the file), then rename the "Phone_1_Inserting" file into "Phone_2_Done" (persisting its changed state) and then you delete it.
Then the service #2 looks if there are new "_Complete" files already ready in the folder, and if not, it goes into step 1 - sleeps until FAM event would awake it
All in all, you should DECOUPLE your services.
https://en.wikipedia.org/wiki/Coupling_%28computer_programming%29
The service who's main responsibility is to be with not a tiny pause ready to get and save data flow, and another service whose responsibility transfer the saved data into SQL database for ease of processing and it is not a big issue if it sometimes makes delays for few seconds as long as it does not lose data in the end.

Python/Hive interface slow with fetchone(), hangs with fetchall()

I have a python script that is querying HiveServer2 using pyhs2, like so:
import pyhs2;
conn = pyhs2.connect(host=localhost,
port=10000,
user='user',
password='password',
database='default');
cur = conn.cursor();
cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC");
line = cur.fetchone();
while line is not None:
<do some processing, including writing to stdout>
.
.
.
line = cur.fetchone();
I have also tried using fetchall() instead of fetchone(), but that just seems to hang forever.
My query runs just fine and returns ~270 million rows. For testing, I dumped the output from Hive into a flat, tab-delimited file and wrote the guts of my python script against that, so I didn't have to wait for the query to finish everytime I ran. My script that reads the flat file will finish in ~20 minutes. What confuses me is that I don't see that same performance when I directly query Hive. In fact, it takes about 5 times longer to finish processing. I am pretty new to Hive, and python so maybe I am making some bone-headed error, but examples that I see online show a set up such as this. I just want to iterate through my Hive return, getting one row at a time as quickly as possible, much like I did using my flat file. Any suggestions?
P.S. I have found this question that sounds similar:
Python slow on fetchone, hangs on fetchall
but that ended up being a SQLite issue, and I have no control over my Hive set up.
Have you considered using fetchmany().
That would be the DBAPI answer for pulling data in chunks (bigger one, where the overhead is an issue, and smaller than all rows, where memory is an issue).

psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?
import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()
I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.
Instead of
cursor = conn.cursor()
write
cursor = conn.cursor(name="my_cursor_name")
or simpler yet
cursor = conn.cursor("my_cursor_name")
The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors
I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include
"DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.
The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.
Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.
import datetime as dt
import psycopg2
import sys
import time
conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server
curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
cptLigne += 1
if cptLigne % 10000 == 0:
print('.', end='')
sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()
As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.
Please see the next answer by #joeblog for the better solution.
First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.
Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.
Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.
Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.
What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.
The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.
It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.
See also:
python - memory not being given back to kernel
Why doesn't memory get released to system after large queries (or series of queries) in django?
Releasing memory in Python
So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

Performance lost when open a db multiple times in BerkeleyDB

I'm using BerkeleyDB to develop a small app. And I have a question about opening a database multiple time in BDB.
I have a large set of text ( corpus ), and I want to load a part of it to do the calculation. I have two pseudo-code (mix with python) here
#1
def getCorpus(token):
DB.open()
DB.get(token)
DB.close()
#2
#open and wait
def openCorpus():
DB.open()
#close database
def closeCorpus():
DB.close()
def getCorpus(token):
DB.get(token)
In the second example, I'll open the db before the calculation, load token for each loop and then close the db.
In the first example, each time the loop ask for the token, I'll open, get and then close the db.
Is there any performance lost ?
I also note that I'm using a DBEnv to manage the database
If you aren't caching the opened file you will always get performance lost because:
you call open() and close() multiple times which are quite expensive,
you lose all potential buffers (both system buffers and bdb internal buffers).
But I wouldn't care too much about the performance before the code is written.

Categories