I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?
I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'
Is there any time saved by sending the result to a list first then inserting to the database table?
All suggestions welcome
conn = psycopg2.connect("<details>")
cur = conn.cursor()
for i in range(len(B)):
i2 = i + 1
# point=T[i]
point = B[i:i2]
# print(B[i])
# print(B[i:i2])
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
print("Base: ", end='')
print(i, end='')
print(" of ", end='')
print(len(B), end='')
print(" ", end='')
print(disti)
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
(xi[i], yi[i], zi[i], disti))
conn.commit()
cur.close()
############## EDIT #############
Code update:
conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
cur = conn.cursor()
disti = []
for i in range(len(T)):
i2 = i + 1
point = T[i:i2]
disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
print("Top: " + str(i) + " of " + str(len(T)))
Insert code to go here once I figure out the syntax
######## EDIT ########
The solution with a lot of help from Alex
cur = conn.cursor()
# list for accumulating insert-params
from scipy.spatial.distance import cdist
insert_params = []
for i in range(len(T)):
XA = [B[i]]
disti = cdist(XA, XB, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
print("Top: " + str(i) + " of " + str(len(T)))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
insert_params)
conn.commit()
For timing comparison the:
inital code took: 0:00:50.225644
Without multiline prints: 0:00:47.934012
taking commit out of the loop: 0:00:25.411207
I'm assuming the only way to make it faster is to get CUDA working?
There are 2 solutions
1) Try to do the single commit or commit in chunks if len(B) is very large.
2) you can prepare a list of data that you are inserting and do the bulk insert.
eg:
insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);
OK. Let's accumulate all suggestions from comments.
Suggesion 1. commit as rare as possible, don't print at all
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))
conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()
If you really need debug information inside your loops - use logging.
You will be able to turn on/off logging info when you need.
Suggestion 2. executemany for rescue
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating insert-params
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)
conn.commit()
cur.close()
Suggestion 3. Don't use psycopg2 at all. Use BULK operations
Instead of cur.execute, conn.commit write csv-file.
And then use COPY from created file.
BULK solution must provide ultimate performance but needs an effort to make it work.
Choose yourself what is appropriate for you - how much speed do you need.
Good luck
Try committing when the loop is finished instead of every single iteration
Related
i have this code for generate unique code,
import mysql.connector as mariadb
import time
import random
mariadb_connection = mariadb.connect(user='XX', password='XX', database='XX', port='3306', host='192.168.XX.XX')
cursor = mariadb_connection.cursor()
FullChar = 'CDHKPMQRVXY123456789'
total = 20
count = 7
entries = []
uq_id = 0
flg_id = None
get_id = None
bcd = None
def inputDatabase(data):
try:
maria_insert_query = "INSERT INTO BB_UNIQUE_CODE(unique_code, flag_code, get_id, barcode) VALUES (%s, %s, %s, %s)"
cursor.executemany(maria_insert_query, data)
mariadb_connection.commit()
print("Commiting " + str(total) + " entries..")
except Exception:
maria_alter_query = "ALTER TABLE PromoBigrolls.BB_UNIQUE_CODE AUTO_INCREMENT=0"
mariadb_connection.rollback()
print("UniqueCode Rollbacked")
cursor.execute(maria_alter_query)
print("UniqueCode Increment Altered")
while (0 < 1) :
for i in range(total):
unique_code = ''.join(random.sample(FullChar, count))
entry = (unique_code, flg_id, get_id, bcd)
entries.append(entry)
inputDatabase(entries)
#print(entries)
entries.clear()
time.sleep(0.1)
my code output:
1 K4C1D9M null null null
2 K2R9XH3 null null null
3 5M3V9R2 null null null
This code is run correctly, but after generated unique code reach 30 M, there is too much rollback, because if there is a same unique code in database it'll rollback the newest data. Any suggestion?
Thanks
In my opinion UUID is definitely the way to go. By the way your usage of random.sample() is rather unusual - you may end up generating the same "unique" code twice (or more) in a row.
I don't know why you would like to customize an UUID, since it is intended to be a meaningless unique identifier; but if you really need your ID to be a string made of the chars in FullChar then you can generate the UUID, convert it to a list of indexes, and use it to build you final string:
def int2code(n, codestring):
base = len(codestring)
numbers = []
while n > 0:
x = n % base
numbers.append(x)
n //= base
chars = [codestring[c] for c in reversed(numbers)]
return ''.join(chars)
unique_code = int2code(int(str(uuid.uuid1().int)), FullChar)
EDIT
As it has been noted by #shoaib30 you were generating 7-chars long codes.
While this is not difficult to handle (the easiest although probably not the smartest way would be to just calculate uuid.uuid1().int % 20**7 and use it in the function above), it can easily bring to collisions: the UUID is a 128 bit integer, or about 3.4+e38 possible values, while the permutations of 7 out of 20 items are just 3.9+e08, or 390 millions. So you have 1.0e+30 different UUIDs which will be translated to the same code.
I want to know what is the best approach for a String Match with Python and a PSQL database. My db contains pubs names and zip codes. I want check if there are observations refering to the same pub but spelled differently by mistake.
Conceptually, I was thinking of looping through all the names and, for each other row in the same zip code, obtain a string similarity metric using strsim. If this metric is above a threshold, I insert it into another SQL table which stores the match candidates.
I think I am being inefficient. In "pseudo-code", having pub_table, candidates_table and using the JaroWinkler function, I mean to do something like:
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
cursor = conn.cursor()
cur.execute("SELECT name, zip from pub_table")
rows = cur.fetchall()
for r in rows:
cur.execute("SELECT name FROM pub_tables WHERE zip = %s", (r[1],))
search = cur.fetchall()
for pub in search:
if jarowinkler.similarity(r[0], pub[0]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
cur.execute(insertion, (r[0], pub[0], zip))
cursor.close ()
conn.commit ()
conn.close ()
I am sorry if am not being clear (novice here). Any guidance for string matching using PSQL and Python will be highly appreciated. Thank you.
Both the SELECT queries are on the same pub_tables table. And the inner loop with the second query on zip-match repeats for every row of pub_tables. You could directly get the zip equality comparison in one query by doing an INNER JOIN of pub_tables with itself.
SELECT p1.name, p2.name, p1.zip
FROM pub_table p1,
pub_table p2
WHERE p1.zip = p2.zip
AND p1.name != p2.name -- this line assumes your original pub_table
-- has unique names for each "same pub in same zip"
-- and prevents the entries from matching with themselves.
That would reduce your code to just the outer query & inner check + insert, without needing a second query:
cur.execute("<my query as above>")
rows = cur.fetchall()
for r in rows:
# r[0] and r[1] are the names. r[2] is the zip code
if jarowinkler.similarity(r[0], r[1]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
# since r already a tuple with the columns in the right order,
# you can replace the `(r[0], r[1], r[2])` below with just `r`
cur.execute(insertion, (r[0], r[1], r[2]))
# or ...
cur.execute(insertion, r)
Another change: The insertion string is always the same, so you can move that to before the for loop and only keep the parameterised cur.execute(insertion, r) inside the loop. Otherwise, you're just redefining the same static string over and over.
I am new to python. What I am trying to achieve is to insert values from my list/tuple into my redshift table without iteration.I have around 1 million rows and 1 column. Below is the code I am using to create my list/tuple.
cursor1.execute("select domain from url limit 5;")
for record, in cursor1:
ext = tldextract.extract(record)
mylist.append(ext.domain + '.' + ext.suffix)
mytuple = tuple(mylist)
I am not sure what is best to use, tuple or list. output of print(mylist) and print(mytuple) are as follows.
List output
['friv.com', 'steep.tv', 'wordpress.com', 'fineartblogger.net',
'v56.org'] Tuple Output('friv.com', 'steep.tv', 'wordpress.com',
'fineartblogger.net', 'v56.org')
Now, below is the code I am using to insert the values into my redshift table but I am getting an error:
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mylist) or
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mytuple)
Error - not all arguments converted during string formatting
Any help is appreciated. If any other detail is required please let me know, I will edit my question.
UPDATE 1:
Tried using below code and getting different error.
args_str = ','.join(cur.mogrify("(%s)", x) for x in mylist)
cur.execute("INSERT INTO table VALUES " + args_str)
ERROR - INSERT has more expressions than target columns
I think you're looking for Fast Execution helpers:
mylist=[('t1',), ('t2',)]
execute_values(cursor2, "INSERT INTO sample(domain) %s", mylist, page_size=100)
what this does is it replaces the %s with 100 VALUES. I'm not sure how high you can set page_size, but that should be far more performant.
Finally found a solution. For some reason cur.mogrify was not giving me proper sql string for insert. Created my own SQl string and it works alot faster than cur.executeall()
list_size = len(mylist)
for len in range(0,list_size):
if ( len != list_size-1 ):
sql = sql + ' ('+ "'"+ mylist[len] + "'"+ ') ,'
else:
sql = sql + '('+ "'"+ mylist[len] + "'"+ ')'
cursor1.execute("INSERT into sample(domain) values " + sql)
Thanks for your help guys!
I've a text file with data like the one below;
South-America; Raul; Segio; 31; 34234556
Africa; Kofi; Adama; 27; 65432875
North-America; James; Watson; 29; 43552376
Africa; Koko; Stevens; 23; 23453243
Europe; Anthony; Baker; 32; 89878627
Well the reasons this is happening is because the create line does not fall within the for loop. Because Europe is the last item in the table, it is the only one which is executed.
You want to move the execution within the for loop, along the lines of:
mydb = MySQLdb.connect(host="127.0.0.1",user="root",passwd="12345678*",db="TESTDB1")
cursor = mydb.cursor()
with open('data.txt', 'r') as z:
for line in z:
m = {}
(m['0'], m['1'], m['2'], m['3']) = line.split(";")
Table = m['0']
sql = "CREATE TABLE IF NOT EXISTS " +Table+ " (name char(40), lastname char(40), age int (3), code int (10))"
cursor.execute(sql)
mydb.close()
i.e. first create the database, then loop over inserting each of the tables you desire, then close it.
Another way would be to just indent the bottom half of your code.
I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.
MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.
Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)