Efficient string Match with SQL and Python - python

I want to know what is the best approach for a String Match with Python and a PSQL database. My db contains pubs names and zip codes. I want check if there are observations refering to the same pub but spelled differently by mistake.
Conceptually, I was thinking of looping through all the names and, for each other row in the same zip code, obtain a string similarity metric using strsim. If this metric is above a threshold, I insert it into another SQL table which stores the match candidates.
I think I am being inefficient. In "pseudo-code", having pub_table, candidates_table and using the JaroWinkler function, I mean to do something like:
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
cursor = conn.cursor()
cur.execute("SELECT name, zip from pub_table")
rows = cur.fetchall()
for r in rows:
cur.execute("SELECT name FROM pub_tables WHERE zip = %s", (r[1],))
search = cur.fetchall()
for pub in search:
if jarowinkler.similarity(r[0], pub[0]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
cur.execute(insertion, (r[0], pub[0], zip))
cursor.close ()
conn.commit ()
conn.close ()
I am sorry if am not being clear (novice here). Any guidance for string matching using PSQL and Python will be highly appreciated. Thank you.

Both the SELECT queries are on the same pub_tables table. And the inner loop with the second query on zip-match repeats for every row of pub_tables. You could directly get the zip equality comparison in one query by doing an INNER JOIN of pub_tables with itself.
SELECT p1.name, p2.name, p1.zip
FROM pub_table p1,
pub_table p2
WHERE p1.zip = p2.zip
AND p1.name != p2.name -- this line assumes your original pub_table
-- has unique names for each "same pub in same zip"
-- and prevents the entries from matching with themselves.
That would reduce your code to just the outer query & inner check + insert, without needing a second query:
cur.execute("<my query as above>")
rows = cur.fetchall()
for r in rows:
# r[0] and r[1] are the names. r[2] is the zip code
if jarowinkler.similarity(r[0], r[1]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
# since r already a tuple with the columns in the right order,
# you can replace the `(r[0], r[1], r[2])` below with just `r`
cur.execute(insertion, (r[0], r[1], r[2]))
# or ...
cur.execute(insertion, r)
Another change: The insertion string is always the same, so you can move that to before the for loop and only keep the parameterised cur.execute(insertion, r) inside the loop. Otherwise, you're just redefining the same static string over and over.

Related

Python CX_Oracle : oracle query with conditions including a list of values(IN)

I would like get a list of values from a DB table in python using cx_oracle. I am unable to write a query with two where conditions one of single value and another of a list.
I am able to achieve it when I filter it two strings separately or only filter it by a list of string. But could not achieve it together!!
output_list=[]
catlist = ','.join(":x%d" % i for i, _ in enumerate(category_list))
db_cursor = connection.cursor()
db_cursor.execute("""
SELECT LWEX_WORD_EXCLUDE
FROM WCG_SRC_WORD_EXCLUDE
WHERE LWEX_CATEGORY IN (%s) and LWIN_USER_UPDATED = :did""" % catlist, category_list, did =argUser)
for word in db_cursor :
output_list.append(word[0])
The current code throws an error. But if I have either of the conditions separately then it works fine. The python version that I am using is 3.5
You cannot mix and match "bind by position" and "bind by name", which is what you are doing in the above code. My suggestion would be to do something like this instead:
output_list=[]
catlist = ','.join(":x%d" % i for i, _ in enumerate(category_list))
bindvals = category_list + [arguser]
db_cursor = connection.cursor()
db_cursor.execute("""
SELECT LWEX_WORD_EXCLUDE
FROM WCG_SRC_WORD_EXCLUDE
WHERE LWEX_CATEGORY IN (%s) and LWIN_USER_UPDATED = :did""" % catlist, bindvals)
for word in db_cursor :
output_list.append(word[0])

Insert list or tuple into table without iteration into postgresql

I am new to python. What I am trying to achieve is to insert values from my list/tuple into my redshift table without iteration.I have around 1 million rows and 1 column. Below is the code I am using to create my list/tuple.
cursor1.execute("select domain from url limit 5;")
for record, in cursor1:
ext = tldextract.extract(record)
mylist.append(ext.domain + '.' + ext.suffix)
mytuple = tuple(mylist)
I am not sure what is best to use, tuple or list. output of print(mylist) and print(mytuple) are as follows.
List output
['friv.com', 'steep.tv', 'wordpress.com', 'fineartblogger.net',
'v56.org'] Tuple Output('friv.com', 'steep.tv', 'wordpress.com',
'fineartblogger.net', 'v56.org')
Now, below is the code I am using to insert the values into my redshift table but I am getting an error:
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mylist) or
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mytuple)
Error - not all arguments converted during string formatting
Any help is appreciated. If any other detail is required please let me know, I will edit my question.
UPDATE 1:
Tried using below code and getting different error.
args_str = ','.join(cur.mogrify("(%s)", x) for x in mylist)
cur.execute("INSERT INTO table VALUES " + args_str)
ERROR - INSERT has more expressions than target columns
I think you're looking for Fast Execution helpers:
mylist=[('t1',), ('t2',)]
execute_values(cursor2, "INSERT INTO sample(domain) %s", mylist, page_size=100)
what this does is it replaces the %s with 100 VALUES. I'm not sure how high you can set page_size, but that should be far more performant.
Finally found a solution. For some reason cur.mogrify was not giving me proper sql string for insert. Created my own SQl string and it works alot faster than cur.executeall()
list_size = len(mylist)
for len in range(0,list_size):
if ( len != list_size-1 ):
sql = sql + ' ('+ "'"+ mylist[len] + "'"+ ') ,'
else:
sql = sql + '('+ "'"+ mylist[len] + "'"+ ')'
cursor1.execute("INSERT into sample(domain) values " + sql)
Thanks for your help guys!

sorting coding Issues

I've a text file with data like the one below;
South-America; Raul; Segio; 31; 34234556
Africa; Kofi; Adama; 27; 65432875
North-America; James; Watson; 29; 43552376
Africa; Koko; Stevens; 23; 23453243
Europe; Anthony; Baker; 32; 89878627
Well the reasons this is happening is because the create line does not fall within the for loop. Because Europe is the last item in the table, it is the only one which is executed.
You want to move the execution within the for loop, along the lines of:
mydb = MySQLdb.connect(host="127.0.0.1",user="root",passwd="12345678*",db="TESTDB1")
cursor = mydb.cursor()
with open('data.txt', 'r') as z:
for line in z:
m = {}
(m['0'], m['1'], m['2'], m['3']) = line.split(";")
Table = m['0']
sql = "CREATE TABLE IF NOT EXISTS " +Table+ " (name char(40), lastname char(40), age int (3), code int (10))"
cursor.execute(sql)
mydb.close()
i.e. first create the database, then loop over inserting each of the tables you desire, then close it.
Another way would be to just indent the bottom half of your code.

Best way to insert ~20 million rows using Python/MySQL

I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.
MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.
Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)

Take a few elements from an iterable, do something, take a few elements more, and so on

Here is some python code moving data from one database in one server to another database in another server:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
cursor1 )
cursor2.execute("COMMIT")
Now, for the sake of exposition, let's say that the transaction runs out of space in the target hard-drive before the commit, and thus the commit is lost. But I'm using the transaction for performance reasons, not for atomicity. So, I would like to fill the hard-drive with commited data so that it remains full and I can show it to my boss. Again, this is for the sake of exposition, the real question is below. In that scenario, I would rather do:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
MAX_ELEMENTS_TO_MOVE_TOGETHER = 1000
dark_spawn = some_dark_magic_with_iterable( cursor1, MAX_ELEMENTS_TO_MOVE_TOGETHER )
for partial_iterable in dark_spawn:
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
partial_iterable )
cursor2.execute("COMMIT")
My question is, which is the right way of filling in some_dark_magic_with_iterable, that is, to create some sort of iterator with pauses in-between?
Just create a generator! :P
def some_dark_magic_with_iterable(curs, nelems):
res = curs.fetchmany(nelems)
while res:
yield res
res = curs.fetchmany(nelems)
Ok, ok... for generic iterators...
def some_dark_magic_with_iterable(iterable, nelems):
try:
while True:
res = []
while len(res) < nelems:
res.append(iterable.next())
yield res
except StopIteration:
if res:
yield res

Categories