Related
I want to know what is the best approach for a String Match with Python and a PSQL database. My db contains pubs names and zip codes. I want check if there are observations refering to the same pub but spelled differently by mistake.
Conceptually, I was thinking of looping through all the names and, for each other row in the same zip code, obtain a string similarity metric using strsim. If this metric is above a threshold, I insert it into another SQL table which stores the match candidates.
I think I am being inefficient. In "pseudo-code", having pub_table, candidates_table and using the JaroWinkler function, I mean to do something like:
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
cursor = conn.cursor()
cur.execute("SELECT name, zip from pub_table")
rows = cur.fetchall()
for r in rows:
cur.execute("SELECT name FROM pub_tables WHERE zip = %s", (r[1],))
search = cur.fetchall()
for pub in search:
if jarowinkler.similarity(r[0], pub[0]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
cur.execute(insertion, (r[0], pub[0], zip))
cursor.close ()
conn.commit ()
conn.close ()
I am sorry if am not being clear (novice here). Any guidance for string matching using PSQL and Python will be highly appreciated. Thank you.
Both the SELECT queries are on the same pub_tables table. And the inner loop with the second query on zip-match repeats for every row of pub_tables. You could directly get the zip equality comparison in one query by doing an INNER JOIN of pub_tables with itself.
SELECT p1.name, p2.name, p1.zip
FROM pub_table p1,
pub_table p2
WHERE p1.zip = p2.zip
AND p1.name != p2.name -- this line assumes your original pub_table
-- has unique names for each "same pub in same zip"
-- and prevents the entries from matching with themselves.
That would reduce your code to just the outer query & inner check + insert, without needing a second query:
cur.execute("<my query as above>")
rows = cur.fetchall()
for r in rows:
# r[0] and r[1] are the names. r[2] is the zip code
if jarowinkler.similarity(r[0], r[1]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
# since r already a tuple with the columns in the right order,
# you can replace the `(r[0], r[1], r[2])` below with just `r`
cur.execute(insertion, (r[0], r[1], r[2]))
# or ...
cur.execute(insertion, r)
Another change: The insertion string is always the same, so you can move that to before the for loop and only keep the parameterised cur.execute(insertion, r) inside the loop. Otherwise, you're just redefining the same static string over and over.
I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?
I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'
Is there any time saved by sending the result to a list first then inserting to the database table?
All suggestions welcome
conn = psycopg2.connect("<details>")
cur = conn.cursor()
for i in range(len(B)):
i2 = i + 1
# point=T[i]
point = B[i:i2]
# print(B[i])
# print(B[i:i2])
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
print("Base: ", end='')
print(i, end='')
print(" of ", end='')
print(len(B), end='')
print(" ", end='')
print(disti)
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
(xi[i], yi[i], zi[i], disti))
conn.commit()
cur.close()
############## EDIT #############
Code update:
conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
cur = conn.cursor()
disti = []
for i in range(len(T)):
i2 = i + 1
point = T[i:i2]
disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
print("Top: " + str(i) + " of " + str(len(T)))
Insert code to go here once I figure out the syntax
######## EDIT ########
The solution with a lot of help from Alex
cur = conn.cursor()
# list for accumulating insert-params
from scipy.spatial.distance import cdist
insert_params = []
for i in range(len(T)):
XA = [B[i]]
disti = cdist(XA, XB, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
print("Top: " + str(i) + " of " + str(len(T)))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
insert_params)
conn.commit()
For timing comparison the:
inital code took: 0:00:50.225644
Without multiline prints: 0:00:47.934012
taking commit out of the loop: 0:00:25.411207
I'm assuming the only way to make it faster is to get CUDA working?
There are 2 solutions
1) Try to do the single commit or commit in chunks if len(B) is very large.
2) you can prepare a list of data that you are inserting and do the bulk insert.
eg:
insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);
OK. Let's accumulate all suggestions from comments.
Suggesion 1. commit as rare as possible, don't print at all
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))
conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()
If you really need debug information inside your loops - use logging.
You will be able to turn on/off logging info when you need.
Suggestion 2. executemany for rescue
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating insert-params
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)
conn.commit()
cur.close()
Suggestion 3. Don't use psycopg2 at all. Use BULK operations
Instead of cur.execute, conn.commit write csv-file.
And then use COPY from created file.
BULK solution must provide ultimate performance but needs an effort to make it work.
Choose yourself what is appropriate for you - how much speed do you need.
Good luck
Try committing when the loop is finished instead of every single iteration
I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.
MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.
Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)
Here is some python code moving data from one database in one server to another database in another server:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
cursor1 )
cursor2.execute("COMMIT")
Now, for the sake of exposition, let's say that the transaction runs out of space in the target hard-drive before the commit, and thus the commit is lost. But I'm using the transaction for performance reasons, not for atomicity. So, I would like to fill the hard-drive with commited data so that it remains full and I can show it to my boss. Again, this is for the sake of exposition, the real question is below. In that scenario, I would rather do:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
MAX_ELEMENTS_TO_MOVE_TOGETHER = 1000
dark_spawn = some_dark_magic_with_iterable( cursor1, MAX_ELEMENTS_TO_MOVE_TOGETHER )
for partial_iterable in dark_spawn:
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
partial_iterable )
cursor2.execute("COMMIT")
My question is, which is the right way of filling in some_dark_magic_with_iterable, that is, to create some sort of iterator with pauses in-between?
Just create a generator! :P
def some_dark_magic_with_iterable(curs, nelems):
res = curs.fetchmany(nelems)
while res:
yield res
res = curs.fetchmany(nelems)
Ok, ok... for generic iterators...
def some_dark_magic_with_iterable(iterable, nelems):
try:
while True:
res = []
while len(res) < nelems:
res.append(iterable.next())
yield res
except StopIteration:
if res:
yield res
I've googled around a bit, but maybe I didn't put the correct magik incantation into the search box.
Does anyone know how to get output parameters from a stored procedure in Python? I'm using pymssql to call a stored procedure, and I'm not sure of the correct syntax to get the output parameter back. I don't think I can use any other db modules since I'm running this from a Linux box to connect to a mssql database on a MS Server.
import pymssql
con = pymssql.connect(host='xxxxx',user='xxxx',password='xxxxx',database='xxxxx')
cur = con.cursor()
query = "EXECUTE blah blah blah"
cur.execute(query)
con.commit()
con.close()
I'm not a python expert but after a brief perusing of the DB-API 2.0 I believe you should use the "callproc" method of the cursor like this:
cur.callproc('my_stored_proc', (first_param, second_param, an_out_param))
Then you'll have the result in the returned value (of the out param) in the "an_out_param" variable.
If you cannot or don't want to modify the original procedure and have access to the database you can write a simple wrapper procedure that is callable from python.
For example, if you have a stored procedure like:
CREATE PROC GetNextNumber
#NextNumber int OUTPUT
AS
...
You could write a wrapper like so which is easily callable from python:
CREATE PROC GetNextNumberWrap
AS
DECLARE #RNextNumber int
EXEC GetNextNumber #RNextNumber
SELECT #RNextNumber
GO
Then you could call it from python like so:
import pymssql
con = pymssql.connect(...)
cur = con.cursor()
cur.execute("EXEC GetNextNumberWrap")
next_num = cur.fetchone()[0]
If you make your procedure produce a table, you can use that result as a substitute for out params.
So instead of:
CREATE PROCEDURE Foo (#Bar INT OUT, #Baz INT OUT) AS
BEGIN
/* Stuff happens here */
RETURN 0
END
do
CREATE PROCEDURE Foo (#Bar INT, #Baz INT) AS
BEGIN
/* Stuff happens here */
SELECT #Bar Bar, #Baz Baz
RETURN 0
END
It looks like every python dbapi library implemented on top of freetds (pymssql, pyodbc, etc) will not be able to access output parameters when connecting to Microsoft SQL Server 7 SP3 and higher.
http://www.freetds.org/faq.html#ms.output.parameters
I was able to get an output value from a SQL stored procedure using Python. I could not find good help getting the output values in Python. I figured out the Python syntax myself, so I suspect this is worth posting here:
import sys, string, os, shutil, arcgisscripting
from win32com.client import Dispatch
from adoconstants import *
#skip ahead to the important stuff
conn = Dispatch('ADODB.Connection')
conn.ConnectionString = "Provider=sqloledb.1; Data Source=NT38; Integrated Security = SSPI;database=UtilityTicket"
conn.Open()
#Target Procedure Example: EXEC TicketNumExists #ticketNum = 8386998, #exists output
Cmd = Dispatch('ADODB.Command')
Cmd.ActiveConnection = conn
Cmd.CommandType = adCmdStoredProc
Cmd.CommandText = "TicketNumExists"
Param1 = Cmd.CreateParameter('#ticketNum', adInteger, adParamInput)
Param1.Value = str(TicketNumber)
Param2 = Cmd.CreateParameter('#exists', adInteger, adParamOutput)
Cmd.Parameters.Append(Param1)
Cmd.Parameters.Append(Param2)
Cmd.Execute()
Answer = Cmd.Parameters('#exists').Value
2016 update (callproc support in pymssql 2.x)
pymssql v2.x offers limited support for callproc. It supports OUTPUT parameters using the pymssql.output() parameter syntax. Note, however, that OUTPUT parameters can only be retrieved with callproc if the stored procedure does not also return a result set. That issue is discussed on GitHub here.
For stored procedures that do not return a result set
Given the T-SQL stored procedure
CREATE PROCEDURE [dbo].[myDoubler]
#in int = 0,
#out int OUTPUT
AS
BEGIN
SET NOCOUNT ON;
SELECT #out = #in * 2;
END
the Python code
import pymssql
conn = pymssql.connect(
host=r'localhost:49242',
database='myDb',
autocommit=True
)
crsr = conn.cursor()
sql = "dbo.myDoubler"
params = (3, pymssql.output(int, 0))
foo = crsr.callproc(sql, params)
print(foo)
conn.close()
produces the following output
(3, 6)
Notice that callproc returns the parameter tuple with the OUTPUT parameter value assigned by the stored procedure (foo[1] in this case).
For stored procedures that return a result set
If the stored procedure returns one or more result sets and also returns output parameters, we need to use an anonymous code block to retrieve the output parameter value(s):
Stored Procedure:
ALTER PROCEDURE [dbo].[myDoubler]
#in int = 0,
#out int OUTPUT
AS
BEGIN
SET NOCOUNT ON;
SELECT #out = #in * 2;
-- now let's return a result set, too
SELECT 'foo' AS thing UNION ALL SELECT 'bar' AS thing;
END
Python code:
sql = """\
DECLARE #out_value INT;
EXEC dbo.myDoubler #in = %s, #out = #out_value OUTPUT;
SELECT #out_value AS out_value;
"""
params = (3,)
crsr.execute(sql, params)
rows = crsr.fetchall()
while rows:
print(rows)
if crsr.nextset():
rows = crsr.fetchall()
else:
rows = None
Result:
[('foo',), ('bar',)]
[(6,)]
You might also look at using SELECT rather than EXECUTE. EXECUTE is (iirc) basically a SELECT that doesn't actually fetch anything (, just makes side-effects happen).
You can try to reformat query:
import pypyodc
connstring = "DRIVER=SQL Server;"\
"SERVER=servername;"\
"PORT=1043;"\
"DATABASE=dbname;"\
"UID=user;"\
"PWD=pwd"
conn = pypyodbc.connect(connString)
cursor = conn.cursor()
query="DECLARE #ivar INT \r\n" \
"DECLARE #svar VARCHAR(MAX) \r\n" \
"EXEC [procedure]" \
"#par1=?," \
"#par2=?," \
"#param1=#ivar OUTPUT," \
"#param2=#svar OUTPUT \r\n" \
"SELECT #ivar, #svar \r\n"
par1=0
par2=0
params=[par1, par2]
result = cursor.execute(query, params)
print result.fetchall()
[1]https://amybughunter.wordpress.com/tag/pypyodbc/
Here's how I did it, the key is to declare output parameter first:
import cx_Oracle as Oracle
conn = Oracle.connect('xxxxxxxx')
cur = conn.cursor()
idd = cur.var(Oracle.NUMBER)
cur.execute('begin :idd := seq_inv_turnover_id.nextval; end;', (idd,))
print(idd.getvalue())
I use pyodbc and then convert the pyodbc rows object to a list. Most of the answers show a query declaring variables as part of the query. But I would think you declare your variables as part of the sp, thus eliminating an unnecessary step in python. Then, in python, all you have to do is pass the parameters to fill in those variables.
Here is the function I use to convert the pyodbc rows object to a usable list (of lists) (note that I have noticed pyodbc sometimes adds trailing spaces, so I account for that which works well for me):
def convert_pyodbc(pyodbc_lst):
'''Converts pyodbc rows into usable list of lists (each sql row is a list),
then examines each list for list elements that are strings,
removes trailing spaces, and returns a usable list.'''
usable_lst = []
for row in pyodbc_lst:
e = [elem for elem in row]
usable_lst.append(e)
for i in range(0,len(usable_lst[0])):
for lst_elem in usable_lst:
if isinstance(lst_elem[i],str):
lst_elem[i] = lst_elem[i].rstrip()
return usable_lst
Now if I need to run a stored procedure from python that returns a results set, I simply use:
strtdate = '2022-02-21'
stpdate = '2022-02-22'
conn = mssql_conn('MYDB')
cursor = conn.cursor()
qry = cursor.execute(f"EXEC mystoredprocedure_using_dates
'{strtdate}','{stpdate}' ")
results = convert_pyodbc(qry.fetchall())
cursor.close()
conn.close()
And sample results which I then take and write to a spreadsheet or w/e:
[[datetime.date(2022, 2, 21), '723521', 'A Team Line 1', 40, 9],
[datetime.date(2022, 2, 21), '723522', 'A Team Line 2', 15, 10],
[datetime.date(2022, 2, 21), '723523', 'A Team Line 3', 1, 5],
[datetime.date(2022, 2, 21), '723686', 'B Team Line 1', 39, 27],
[datetime.date(2022, 2, 21), '723687', 'B Team Line 2', 12, 14]]