I've tried to see other similar problems, but they did not quite give me the answer I was looking for. My main goal is to store a large dataset into a google cloud. I tried to store around 1000 insert statement to google cloud and it went well.
In the other hand, storing over 200.000 insert statement proved to be challenge than I thought. In my code, I have this function:
def insert_to_gcloud(information):
db = psycopg2.connect(database="database",
user="user",
password="password",
host="host",
port=1234)
cur = db.cursor()
cur.execute("".join(information))
db.commit()
db.close()
I use batch insert (suppose to be faster) where the first index contain the insert statement and the rest values. Then I use "".join() to make it into a string. To make a simple example:
INSERT INTO tbl_name (a,b,c) VALUES (1,2,3),(4,5,6),(7,8,9);
["INSERT INTO tbl_name (a,b,c) VALUES",
"(1,2,3)," ,
"(4,5,6)," ,
"(7,8,9);" ]
At first, I tried to execute 200.000 insert statement, but I got an error about EOF after about 1 min. I guess it was to big to send, so I made a chunk function. Basically, it divides the array into proper size as I want it. (reference)
Then used a simple for-loop to store each one block at a time:
for i in range(len(chunk_block)):
insert_to_gcloudd(chunk_block[i])
It seems like it was working, but I let it run over the night, used 237 min, I got this message:
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
My next test is to store the chunk block into files, read and log the files. If the server gets closed unexpectedly again, I can check the log. Thought, this is not a good way to do it in my opinion, but I'm out of ideas here.
So to my question, is there any option I could try? Maybe there are other tools I can use? 🙂
Related
I have a Python script which reads files into my tables in MySQL. Now this program runs automatically every now and then. However I'am afraid of 2 things:
Their might come a time the program stops running because it cant connect to the MySQL server. There are a lot of processes depending on this tables, so if the tables are not up to date the rest of my process will also stop working.
Their might sneak a file inside the process which does not have the expected content. After the script finished running, every value of column X must have 12 rows. If it does not have 12 rows this means the files did not have the right content inside them.
My question is: Is there something I can do to tackle this before it happens? Like send an e-mail to myself so I can be notified if the connection fails or like run the program on another server or if a certain value has like NOT 12 rows?
I'm very eager to know how you guys handle this situations.
I have a very simple connection made like this:
mydb = mysql.connector.connect(
host= 'localhost',
user = 'root',
passwd = '*****.',
database= 'my_database'
)
The event you are talking about is very unlikely to happen, and the only possible situation I could see this happening is when your database runs out of memory. For which you can set up a 2 min period every 2 to 3 days when you will go and check the amount of memory left in your server.
I'm trying to call a stored procedure in my MSSQL database from a python script, but it does not run completely when called via python. This procedure consolidates transaction data into hour/daily blocks in a single table which is later grabbed by the python script. If I run the procedure in SQL studio, it completes just fine.
When I run it via my script, it gets cut short about 2/3's of the way through. Currently I found a work around, by making the program sleep for 10 seconds before moving on to the next SQL statement, however this is not time efficient and unreliable as some procedures may not finish in that time. I'm looking for a more elegant way to implement this.
Current Code:
cursor.execute("execute mySP")
time.sleep(10)
cursor.commit()
The most related article I can find to my issue is here:
make python wait for stored procedure to finish executing
I tried the solution using Tornado and I/O generators, but ran into the same issue as listed in the article, that was never resolved. I also tried the accepted solution to set a runningstatus field in the database by my stored procedures. At the beginnning of my SP Status is updated to 1 in RunningStatus, and when the SP finished Status is updated to 0 in RunningStatus. Then I implemented the following python code:
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
cursor.commit()
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
When I implement this, the same problem happens as before with my storedprocedure finishing executing prior to it actually being complete. If I eliminate my cursor.commit(), as follows, I end up with the connection just hanging indefinitely until I kill the python process.
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
Any assistance in finding a more efficient and reliable way to implement this, as opposed to time.sleep(10) would be appreciated.
As OP found out, inconsistent or imcomplete processing of stored procedures from application layer like Python may be due to straying from best practices of TSQL scripting.
As #AaronBetrand highlights in this Stored Procedures Best Practices Checklist blog, consider the following among other items:
Explicitly and liberally use BEGIN ... END blocks;
Use SET NOCOUNT ON to avoid messages sent to client for every row affected action, possibly interrupting workflow;
Use semicolons for statement terminators.
Example
CREATE PROCEDURE dbo.myStoredProc
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM foo;
SELECT * FROM bar;
END
GO
I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?
Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.
As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.
While using cx_Oracle(Python), the code goes into waiting when the the following statement is executed:
some_connection.execute(some_sql)
What could be the reason?
Without seeing the actual SQL in question it is hard to know for sure. Some possible answers include:
1) the SQL actually takes a long time to execute (and you just have to be patient)
2) the SQL is blocked by another transaction (and that transaction needs to be committed or rolled back first)
You can find out by examining the contents of dba_locks, specifically looking at the blocking_others column. You can also attempt to issue the same SQL in SQL*Plus and see if it exhibits the same behaviour.
I'm kind of new to Python and its MySQLdb connector.
I'm writing an API to return some data from a database using the RESTful approach. In PHP, I wrapped the Connection management part in a class, acting as an abstraction layer for MySQL queries.
In Python:
I define the connection early on in the script: con = mdb.connect('localhost', 'user', 'passwd', 'dbname')
Then, in all subsequent methods:
import MySQLdb as mdb
def insert_func():
with con:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("INSERT INTO table (col1, col2, col3) VALUES (%s, %s, %s)", (val1, val2, val3) )
rows = cur.fetchall()
#do something with the results
return someval
etc.
I use mdb.cursors.DictCursor because I prefer to be able to access database columns in an associative array manner.
Now the problems start popping up:
in one function, I issue an insert query to create a 'group' with unique 'groupid'.
This 'group' has a creator. Every user in the database holds a JSON array in the 'groups' column of his/her row in the table.
So when I create a new group, I want to assign the groupid to the user that created it.
I update the user's record using a similar function.
I've wrapped the 'insert' and 'update' parts in two separate function defs.
The first time I run the script, everything works fine.
The second time I run the script, the script runs endlessly (I suspect due to some idle connection to the MySQL database).
When I interrupt it using CTRL + C, I get one of the following errors:
"'Cursor' object has no attribute 'connection'"
"commands out of sync; you can't run this command now"
or any other KeyboardInterrupt exception, as would be expected.
It seems to me that these errors are caused by some erroneous way of handling connections and cursors in my code.
I read it was good practice to use with con: so that the connection will automatically close itself after the query. I use 'with' on 'con' in each function, so the connection is closed, but I decided to define the connection globally, for any function to use it. This seems incompatible with the with con: context management. I suspect the cursor needs to be 'context managed' in a similar way, but I do not know how to do this (To my knowledge, PHP doesn't use cursors for MySQL, so I have no experience using them).
I now have the following questions:
Why does it work the first time but not the second? (it will however, work again, once, after the CTRL + C interrupt).
How should I go about using connections and cursors when using multiple functions (that can be called upon in sequence)?
I think there are two main issues going on here- one appears to be python code and the other is the structure of how you're interacting to your DB.
First, you're not closing your connection. This depends on your application's needs - you have to decide how long it should stay open. Reference this SO question
from contextlib import closing
with closing( connection.cursor() ) as cursor:
... use the cursor ...
# cursor closed. Guaranteed.
connection.close()
Right now, you have to interrupt your program with Ctl+C because there's no reason for your with statement to stop running.
Second, start thinking about your interactions with the DB in terms of 'transactions'. Do something, commit it to the DB, if it didn't work, rollback, if it did, close the connection. Here's a tutorial.
With connections, as with file handles the rule of thumb is open late, close early.
So I would recommend share connections only where they are trying to do one thing. Or if you multiprocess, then each process gets a connection, again following open late, close early. And if you are doing sequential operation (say in a loop) open and close outside the loop. Having global connections can get messy. Mainly because now you have to keep track of which function uses it at what time, and what it tries to do with it.
The issue of "cannot run command now", is because your keyboard interrupt kills the active connection.
As to part one of your question - endlessly could be anywhere. Each instance of python will get its own connection. So when you run it the second time it should get its own connection. Open up a mysql client and do
show full processlist
to see whats going on.