i am inserting records to sql server from python using pymssql. The database takes 2 milliseconds to execute a query, yet it insert 6 rows per second. The only problem is at code side. how to optimize following code or what is the fastest method to insert records.
def save(self):
conn = pymssql.connect(host=dbHost, user=dbUser,
password=dbPassword, database=dbName, as_dict=True)
cur = conn.cursor()
self.pageURL = self.pageURL.replace("'","''")
query = "my query is there"
cur.execute(query)
conn.commit()
conn.close()
It looks like you're creating a new connection per insert there. That's probably the major reason for the slowdown: building new connections is typically quite slow. Create the connection outside the method and you should see a large improvement. You can also create a cursor outside function and re-use it, which will be another speedup.
Depending on your situation, you may also want to use the same transaction for more than a single insertion. This changes the behaviour a little -- since a transaction is supposed to be atomic and either completely succeeds or completely fails -- but committing a transaction is typically a slow operation, because it has to be certain the whole operation succeeded.
In addition to Thomas' great advice,
I'd suggest you look into executemany()*, e.g.:
cur.executemany("INSERT INTO persons VALUES(%d, %s)",
[ (1, 'John Doe'), (2, 'Jane Doe') ])
...where the second argument of executemany() should be a sequence of rows to insert.
This brings up another point:
You probably want to send your query and query parameters as separate arguments to either execute() or executemany(). This will allow the PyMSSQL module to handle any quoting issues for you.
*executemany() as described in the Python DB-API:
.executemany(operation,seq_of_parameters)
Prepare a database operation (query or
command) and then execute it against
all parameter sequences or mappings
found in the sequence
seq_of_parameters.
Related
This question already has answers here:
How to use variables in SQL statement in Python?
(5 answers)
Closed 2 months ago.
def update_inv_quant():
new_quant = int(input("Enter the updated quantity in stock: "))
Hello! I'm wondering how to insert a user variable into an sql statement so that a record is updated to said variable. Also, it'd be really helpful if you could also help me figure out how to print records of the database into the actual python console. Thank you!
I tried doing soemthing like ("INSERT INTO Inv(ItemName) Value {user_iname)") but i'm not surprised it didnt work
It would have been more helpful if you specified an actual database.
First method (Bad)
The usual way (which is highly discouraged as Graybeard said in the comments) is using python's f-string. You can google what it is and how to use it more in-depth.
but basically, say you have two variables user_id = 1 and user_name = 'fish', f-string turns something like f"INSERT INTO mytable(id, name) values({user_id},'{user_name}')" into the string INSERT INTO mytable(id,name) values(1,'fish').
As we mentioned before, this causes something called SQL injection. There are many good youtube videos that demonstrate what that is and why it's dangerous.
Second method
The second method is dependent on what database you are using. For example, in Psycopg2 (Driver for PostgreSQL database), the cursor.execute method uses the following syntax to pass variables cur.execute('SELECT id FROM users WHERE cookie_id = %s',(cookieid,)), notice that the variables are passed in a tuple as a second argument.
All databases use similar methods, with minor differences. For example, I believe SQLite3 uses ? instead of psycopg2's %s. That's why I said that specifying the actual database would have been more helpful.
Fetching records
I am most familiar with PostgreSQL and psycopg2, so you will have to read the docs of your database of choice.
To fetch records, you send the query with cursor.execute() like we said before, and then call cursor.fetchone() which returns a single row, or cursor.fetchall() which returns all rows in an iterable that you can directly print.
Execute didn't update the database?
Statements executing from drivers are transactional, which is a whole topic by itself that I am sure will find people on the internet who can explain it better than I can. To keep things short, for the statement to physically change the database, you call connection.commit() after cursor.execute()
So finally to answer both of your questions, read the documentation of the database's driver and look for the execute method.
This is what I do (which is for sqlite3 and would be similar for other SQL type databases):
Assuming that you have connected to the database and the table exists (otherwise you need to create the table). For the purpose of the example, i have used a table called trades.
new_quant = 1000
# insert one record (row)
command = f"""INSERT INTO trades VALUES (
'some_ticker', {new_quant}, other_values, ...
) """
cur.execute(command)
con.commit()
print('trade inserted !!')
You can then wrap the above into your function accordingly.
I am executing a number of SELECT queries on a postgres database using psycopg2, but I am getting ERROR: Out of shared memory. It suggests that I should increase max_locks_per_transaction., but this confuses me because each SELECT query is operating on only one table, and max_locks_per_transaction is already set to 512, 8 times the default.
I am using TimescaleDB, which could be the result of a larger than normal number of locks (one for each chunk rather than one for each table, maybe), but this still can't explain running out when so many are allowed. I'm assuming what is happening here is that all the queries are all being run as part of one transaction.
The code I am using looks something as follows.
db = DatabaseConnector(**connection_params)
tables = db.get_table_list()
for table in tables:
result = db.query(f"""SELECT a, b, COUNT(c) FROM {table} GROUP BY a, b""")
print(result)
Where db.query is defined as:
def query(self, sql):
with self._connection.cursor() as cur:
cur.execute(sql)
return_value = cur.fetchall()
return return_value
and self._connection is:
self._connection = psycopg2.connect(**connection_params)
Do I need to explicitly end the transaction in some way to free up locks? And how can I go about doing this in psycopg2? I would have assumed that there was an implicit end to the transaction when the cursor is closed on __exit__. I know if I was inserting or deleting rows I would use COMMIT at the end, but it seems strange to use as I am not changing the table.
UPDATE: When I explicitly open and close the connection in the loop, the error does not show. However, I assume there is a better way to end the transaction after each SELECT than this.
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.
This isn't a question, so much as a pre-emptive answer. (I have gotten lots of help from this website & wanted to give back.)
I was struggling with a large bit of SQL query that was failing when I tried to run it via python using pymssql, but would run fine when directly through MS SQL. (E.g., in my case, I was using MS SQL Server Management Studio to run it outside of python.)
Then I finally discovered the problem: pymssql cannot handle temporary tables. At least not my version, which is still 1.0.1.
As proof, here is a snippet of my code, slightly altered to protect any IP issues:
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB)
cur = conn.cursor()
cur.execute(testQuery)
The above code FAILS (returns no data, to be specific, and spits the error "pymssql.OperationalError: No data available." if you call cur.fetchone() ) if I call it with testQuery defined as below:
testQuery = """
CREATE TABLE #TEST (
[sample_id] varchar (256)
,[blah] varchar (256) )
INSERT INTO #TEST
SELECT DISTINCT
[sample_id]
,[blah]
FROM [myTableOI]
WHERE [Shipment Type] in ('test')
SELECT * FROM #TEST
"""
However, it works fine if testQuery is defined as below.
testQuery = """
SELECT DISTINCT
[sample_id]
,[blah]
FROM [myTableOI]
WHERE [Shipment Type] in ('test')
"""
I did a Google search as well as a search within Stack Overflow, and couldn't find any information regarding the particular issue. I also looked under the pymssql documentation and FAQ, found at http://code.google.com/p/pymssql/wiki/FAQ, and did not see anything mentioning that temporary tables are not allowed. So I thought I'd add this "question".
Update: July 2016
The previously-accepted answer is no longer valid. The second "will NOT work" example does indeed work with pymssql 2.1.1 under Python 2.7.11 (once conn.autocommit(1) is replaced with conn.autocommit(True) to avoid "TypeError: Cannot convert int to bool").
For those who run across this question and might have similar problems, I thought I'd pass on what I'd learned since the original post. It turns out that you CAN use temporary tables in pymssql, but you have to be very careful in how you handle commits.
I'll first explain by example. The following code WILL work:
testQuery = """
CREATE TABLE #TEST (
[name] varchar(256)
,[age] int )
INSERT INTO #TEST
values ('Mike', 12)
,('someone else', 904)
"""
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB) ## obviously setting up proper variables here...
conn.autocommit(1)
cur = conn.cursor()
cur.execute(testQuery)
cur.execute("SELECT * FROM #TEST")
tmp = cur.fetchone()
tmp
This will then return the first item (a subsequent fetch will return the other):
('Mike', 12)
But the following will NOT work
testQuery = """
CREATE TABLE #TEST (
[name] varchar(256)
,[age] int )
INSERT INTO #TEST
values ('Mike', 12)
,('someone else', 904)
SELECT * FROM #TEST
"""
conn = pymssql.connect(host=sqlServer, user=sqlID, password=sqlPwd, \
database=sqlDB) ## obviously setting up proper variables here...
conn.autocommit(1)
cur = conn.cursor()
cur.execute(testQuery)
tmp = cur.fetchone()
tmp
This will fail saying "pymssql.OperationalError: No data available." The reason, as best I can tell, is that whether you have autocommit on or not, and whether you specifically make a commit yourself or not, all tables must explicitly be created AND COMMITTED before trying to read from them.
In the first case, you'll notice that there are two "cur.execute(...)" calls. The first one creates the temporary table. Upon finishing the "cur.execute()", since autocommit is turned on, the SQL script is committed, the temporary table is made. Then another cur.execute() is called to read from that table. In the second case, I attempt to create & read from the table "simultaneously" (at least in the mind of pymssql... it works fine in MS SQL Server Management Studio). Since the table has not previously been made & committed, I cannot query into it.
Wow... that was a hassle to discover, and it will be a hassle to adjust my code (developed on MS SQL Server Management Studio at first) so that it will work within a script. Oh well...
I was looking at the question and decided to try using the bind variables. I use
sql = 'insert into abc2 (interfield,textfield) values (%s,%s)'
a = time.time()
for i in range(10000):
#just a wrapper around cursor.execute
db.executeUpdateCommand(sql,(i,'test'))
db.commit()
and
sql = 'insert into abc2 (intfield,textfield) values (%(x)s,%(y)s)'
for i in range(10000):
db.executeUpdateCommand(sql,{'x':i,'y':'test'})
db.commit()
Looking at the time taken for the two sets, above it seems like there isn't much time difference. In fact, the second one takes longer. Can someone correct me if I've made a mistake somewhere? using psycopg2 here.
The queries are equivalent in Postgresql.
Bind is oracle lingo. When you use it will save the query plan so the next execution will be a little faster. prepare does the same thing in Postgres.
http://www.postgresql.org/docs/current/static/sql-prepare.html
psycopg2 supports an internal 'bind', not prepare with cursor.executemany() and cursor.execute()
(But don't call it bind to pg people. Call it prepare or they may not know what you mean:)
IMPORTANT UPDATE :
I've seen into source of all python libraries to connect to PostgreSQL in FreeBSD ports and can say, that only py-postgresql does real prepared statements! But it is Python 3+ only.
also py-pg_queue is funny lib implementing official DB protocol (python 2.4+)
You've missed answer for that question about prepared statements to use as many as possible. "Binded variables" are better form of this, let's see:
sql_q = 'insert into abc (intfield, textfield) values (?, ?)' # common form
sql_b = 'insert into abc2 (intfield, textfield) values (:x , :y)' # should have driver and db support
so your test should be this:
sql = 'insert into abc2 (intfield, textfield) values (:x , :y)'
for i in range (10000):
cur.execute(sql, x=i, y='test')
or this:
def _data(n):
for i in range (n):
yield (i, 'test')
sql = 'insert into abc2 (intfield, textfield) values (? , ?)'
cur.executemany(sql, _data(10000))
and so on.
UPDATE:
I've just found interest reciple how to transparently replace SQL queries with prepared and with usage of %(name)s
As far as I know, psycopg2 has never supported server-side parameter binding ("bind variables" in Oracle parlance). Current versions of PostgreSQL do support it at the protocol level using prepared statements, but only a few connector libraries make use of it. The Postgres wiki notes this here. Here are some connectors that you might want to try: (I haven't used these myself.)
pg8000
python-pgsql
py-postgresql
As long as you're using DB-API calls, you probably ought to consider cursor.executemany() instead of repeatedly calling cursor.execute().
Also, binding parameters to their query in the server (instead of in the connector) is not always going to be faster in PostgreSQL. Note this FAQ entry.