Python code to cancel a running Oracle SQL Query - python

I have the following python code which runs multiple SQL Queries in Oracle database and combines them into one dataframe.
The queries exist in a txt file and every row is a separate SQL query. The loop runs sequentially the queries. I want to cancel any SQL queries that run for more than 10 secs so as not to create an overhead in the database.
The following code doesnt actually me give the results that i want. More specifically this bit of the code really help me on my issue:
if (time.time() - start) > 10:
connection.cancel()
Full python code is the following. Probably it is an oracle function that can be called so as to cancel the query.
import pandas as pd
import cx_Oracle
import time
ip = 'XX.XX.XX.XX'
port = XXXX
svc = 'XXXXXX'
dsn_tns = cx_Oracle.makedsn(ip, port, service_name = svc)
connection = cx_Oracle.connect(user='XXXXXX'
, password='XXXXXX'
, dsn=dsn_tns
, encoding = "UTF-8"
, nencoding = "UTF-8"
)
filepath = 'C:/XXXXX'
appended_data = []
with open(filepath + 'sql_queries.txt') as fp:
line = fp.readline()
while line:
start = time.time()
df = pd.read_sql(line, con=connection)
if (time.time() - start) > 10:
connection.cancel()
print("Cancel")
appended_data.append(df)
df_combined = pd.concat(appended_data, axis=0)
line = fp.readline()
print(time.time() - start)
fp.close()

A better approach would be to spend some time tuning the queries to make them as efficient as necessary. As #Andrew points out we can't easily kill a database query from outside the database - or even from another session inside the database (it requires DBA level privileges).
Indeed, most DBAs would rather you ran a query for 20 seconds rather than attempt to kill every query which runs more than 10. Apart from anything else, having a process which polls you query to see how long it's been running for is itself a waste of database resources.
I suggest you discuss this with your DBA. You may find you're worrying about nothing.

Look at cx_Oracle 7's Connection.callTimeout setting. You'll need to be using Oracle client libraries 18+. (These will connect to Oracle DB 11.2+). The doc for the equivalent node-oracledb parameter explains the fine print behind the Oracle behavior and round trips.

Related

How increate efffciency insert data in PostGIS with Python?

I need to insert 46mln points into PostGIS database in a decent time. Inserting 14mln points was executing around 40 minutes, it its awful and inefficient.
I created database with spatial GIST index and wrote this code:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2', user='postgres', password='alfabet1')
filepath = "C:\\Users\\nmt1m.csv"
curs = conn.cursor()
with open(filepath, 'r') as text:
for i in text:
i = i.replace("\n", "")
i = i.split(sep=" ")
curs.execute(f"INSERT INTO nmt_1 (geom, Z) VALUES (ST_GeomFromText('POINTZ({i[0]} {i[1]} {i[2]})',0), {i[2]});")
conn.commit()
end = time.time()
print(end - start)
curs.close()
conn.close()
Im looking for the best way to inserting data, it not must be in python.
Thanks ;)
Cześć! Welcome to SO.
There are a few things you can do to speed up your bulk insert:
If the target table is empty or is not being used in a production system, consider dropping the indexes right before inserting the data. After the insert is complete you can recreate them. This will avoid PostgreSQL to re-index your table after every insert, which in your case means 46 million times.
If the target table can be entirely built from your CSV file, consider creating an UNLOGGED TABLE. Unlogged tables are much faster than "normal" tables, since they (as the name suggests) are not logged in the WAL file (write-ahead log). Unlogged tables might be lost in case of database crash or an unclean shutdown!
Use either the PostgreSQL COPY command or copy_from as #MauriceMeyer pointed out. If for some reason you must stick to inserts, make sure you're not committing after every insert ;-)
Cheers
Thanks Jim for help, according to your instructions better way to insert data is:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2',
user='postgres', password='alfabet1')
curs = conn.cursor()
filepath = "C:\\Users\\Jakub\\PycharmProjects\\test2\\testownik9_NMT\\nmt1m.csv"
curs.execute("CREATE UNLOGGED TABLE nmt_10 (id_1 FLOAT, id_2 FLOAT, id_3 FLOAT);")
with open(filepath, 'r') as text:
curs.copy_from(text, 'nmt_10', sep=" ")
curs.execute("SELECT AddGeometryColumn('nmt_10', 'geom', 2180, 'POINTZ', 3);")
curs.execute("CREATE INDEX nmt_10_index ON nmt_10 USING GIST (geom);")
curs.execute("UPDATE nmt_10 SET geom = ST_SetSRID(ST_MakePoint(id_1, id_2, id_3), 2180);")
conn.commit()
end = time.time()
print(end - start)
cheers

mongodb 4x slower than sqlite, 2x slower than csv?

I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as #D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?
MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
The client has to serialize the data to send to the server
The server has to deserialize that data
The server then has to parse the request and figure out what to do
The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
The server needs to compose a response back to the client, serialize that response, write it to the network
Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.

Accessing large datasets with Python 3.6, psycopg2 and pandas

I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.
To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.
Goals
load the data as quickly as possible without doing any "unnatural
acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)
Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.
/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras
/// Connect to database - works
conn_chunky = psycopg2.connect(
database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000
/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query)
/// result returns None (normal behavior) but result is not iterable
df = pd.DataFrame(result.fetchall())
The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.
See related questions (not the same issue):
Streaming data from Postgres into Python
psycopg2 leaking memory after large query
Added standard client side chunking code per request
nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
if first_loop:
df = dfx
first_loop = False
else:
df = df.append(dfx,ignore_index=True)
UPDATE:
#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
OLD answer:
I'd try to read data from PostgreSQL using internal Pandas method: read_sql():
from sqlalchemy import create_engine
engine = create_engine('postgresql://user#localhost:5432/dbname')
df = pd.read_sql(sql_query, engine)

python script hangs when calling cursor.fetchall() with large data set

I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?
I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!

python+MySQLdb, simple select is too slow compared to flat file access

I have one simple table with 80 000 rows.
I'm trying to select and save all rows to python list as fast as possible.
It's taking around 4 - 10 seconds.
In contrast If I dump exact same table into csv file and process it with this code
f = open('list.csv','rb')
lines = f.read().splitlines()
f.close()
print len(lines)
It's taking only 0.08 - 0.3 second
I tried MySQLdb and mysql.connector using fetchall() or fetchone()
import time
start = time.time()
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'login', 'p', 'db');
with con:
cur = con.cursor()
cur.execute("SELECT * FROM table")
rows = cur.fetchall()
print len(rows)
print 'MySQLdb %s' % (time.time()-start)
Took 3.7 - 8 seconds with high CPU load
Is it possible to achieve same speed like with that csv file?
EDIT
My MySQL server seems to be ok.
In mysql console:
SELECT * from TABLE;
....
80789 rows in set (0.21 sec)
The entire query result will be restored in client side as a list when cur.execute(...) is done. Check self._rows attribute in MySQLdb/cursor.py for this.
That is to say, time cost differs in reading file content and fetching query results from a MySQL database. As we all, a built-in function is always faster than a 3PP function. So I don't think there is a way to make cursor.execute() same speed as open().
As for why open() is faster, I suggest you look into Python source code. Here is the link.
Hope it helps.

Categories