I am trying to execute a delete statement that checks if the table has any SKU that exists in the SKU column of the dataframe. And if it does, it deletes the row. As I am using a for statement to iterate through the rows and check, it takes a long time to run the program for 6000 rows of data.
I used executemany() as it was faster than using a for loop for the delete statement, but I am finding it hard to find an alternative for checking values in the dataframe.
sname = input("Enter name: ")
cursor = mydb.cursor(prepared=True)
column = df["SKU"]
data=list([(sname, x) for x in column])
query="""DELETE FROM price_calculations1 WHERE Name=%s AND SKU=%s"""
cursor.executemany(query,data)
mydb.commit()
cursor.close()
Is there a more efficient code for achieving the same?
You could first use a GET id FROM price_calculations1 WHERE Name=%s AND SKU=%s
and then use a MYSQL WHILE loop to delete these ids without the need of a cursor, which seems to be more performant.
See: https://www.mssqltips.com/sqlservertip/6148/sql-server-loop-through-table-rows-without-cursor/
A WHILE loop without the previous get, might also work.
See: https://dev.mysql.com/doc/refman/8.0/en/while.html
Rather than looping, try to do all the work in a single call to the database (this guideline is often applicable when working with databases).
Given a list of name / sku pairs:
pairs = [(name1, sku1), (name2, sku2), ...]
create a query that identifies all the matching records and deletes them
base_query = """DELETE FROM t1.price_calculations1 t1
WHERE t1.id IN (
SELECT t2.id FROM price_calculations1 t2
WHERE {})
"""
# Build the WHERE clause criteria
criteria = "OR ".join(["(name = %s AND sku = %s)"] * len(pairs))
# Create the query
query = base_query.format(criteria)
# "Flatten" the value pairs
values = [i for j in pairs for i in j]
cursor.execute(query, values)
cursor.commit()
Related
I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.
Say I have a list of following values:
listA = [1,2,3,4,5,6,7,8,9,10]
I want to put each value of this list in a column named formatteddate in my SQLite database using executemany command rather than loop through the entire list and inserting each value separately.
I know how to do it if I had multiple columns of data to insert. For instance, if I had to insert listA,listB,listC then I could create a tuple like (listA[i],listB[i],listC[i]). Is it possible to insert one list of values without a loop. Also assume the insert values are integers.
UPDATE:
Based on the answer provided I tried the following code:
def excutemanySQLCodewithTask(sqlcommand,task,databasefilename):
# create a database connection
conn = create_connection(databasefilename)
with conn:
cur = conn.cursor()
cur.executemany(sqlcommand,[(i,) for i in task])
return cur.lastrowid
tempStorage = [19750328, 19750330, 19750401, 19750402, 19750404, 19750406, 19751024, 19751025, 19751028, 19751030]
excutemanySQLCodewithTask("""UPDATE myTable SET formatteddate = (?) ;""",tempStorage,databasefilename)
It still takes too long (roughly 10 hours). I have 150,000 items in tempStorage. I tried INSERT INTO and that was slow as well. It seems like it isn't possible to make a list of tuple of integers.
As you say, you need a list of tuples. So you can do:
cursor.executemany("INSERT INTO my_table VALUES (?)", [(a,) for a in listA])
I'm trying to insert multiple rows into a table using a for-loop in python using the following code:
ID = 0
values = ['a', 'b', 'c']
for x in values:
database.execute("INSERT INTO table (ID, value) VALUES (:ID, :value)",
ID = ID, value = x)
ID += 1
What I'd expected to happen was that this piece of code would insert three rows into my table. The only problem is that it only executes the query once. So I'd only get the row " 0, 'a' ".
There aren't any error messages popping up, it just doesn't update the table with the other two values. Weirdly enough however, I can circumvent this problem by using multiple queries, like so:
ID = 0
values = ['a', 'b', 'c']
for x in values:
database.execute("INSERT INTO table (ID) VALUES (:ID)", ID = ID)
database.execute("INSERT INTO table (value) VALUES (:value)", value = x)
ID += 1
While this updates my code, this method becomes more tedious as I add columns to my table further down the line. Does anyone know why the first snippet of code doesn't work and the second one does?
The execute method takes an array as the second parameter.
execute(sql[, parameters])
Executes an SQL statement. The SQL statement may be parameterized (i. e. placeholders instead of SQL literals). The sqlite3 module
supports two kinds of placeholders: question marks (qmark style) and
named placeholders (named style).
This should work:
database.execute("INSERT INTO table (ID, value) VALUES (:ID, :value)", [ID , x])
You might want to investigte executemany while you're in the doc.
From the same doc:
commit()
This method commits the current transaction. If you don’t call this method, anything you did since the last call to commit() is not
visible from other database connections. If you wonder why you don’t
see the data you’ve written to the database, please check you didn’t
forget to call this method.
You might want to investigte executemany while you're in the doc.
The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?
Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.
You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)
I have an Sq Lite 3 database which has the columns ID,name and time.
So I have the last row and placed in a var LAST_PERSON using python.
your_rank = "SELECT usr_name,time FROM rank WHERE ID = (SELECT MAX(ID) FROM rank)"
I also have a var ROW which loops through each row sorted order by time.
sql = "SELECT usr_name,time FROM rank ORDER BY time "
for row in cur.execute(sql):
I want to compare:
your_rank with the sorted by time row and get that last person's rank
I tried
for row in cur.execute(sql):
sql_list.append(row)
if(row is your_rank):
this_is_your_rank = rank_number
rank_number += 1
But I cannot use the if statements for Sq Lite 3 and I have not being able to find any solution to compare these. Can you anyone give me a click?
If you cannot, thanks taking your time to reading.
You want to select count(ID) from rank where time < your_time or similar.
Looping over SQL results to find out what you want is clunky when you can just ask the database to give you the answer you want.
Edit:
Your first query, where you join the table to itself to get the user with the highest ID, can be:
SELECT MAX(ID),usr_name,time FROM rank
And you can combine them both together into "get the most recent user name, time, and their position" with:
SELECT
MAX(ID),usr_name,time, (SELECT COUNT(ID)+1 FROM rank WHERE time < r.time) [Pos]
FROM
rank r
Edit again, oh ok ok. Instead of "can anyone give me a solution?", it's not clear what you mean by "I can't use the if statement", but here's some speculation:
If you actually typed in row is your_rank, assuming you actually executed the your_rank SQL query and saved the result over the top in the same variable name, then it fails because is is a Python keyword for testing whether two things are the same thing (that is, one thing with two names). It does not test whether two separate things have the same value. == is the equality test.
It also might fail because the result of a SQL query is effectively a list of tuples. Each row is a tuple and, depending on what you did to put the result in your_rank, they won't ever match when compared.
This might work, if you want to keep the same approach:
last_user = cursor.execute('select max(id),usr_name,time from rank').fetchone()
last_user_rank = 1
for row in cursor.execute('select id,usr_name,time from rank order by time asc'):
if last_user[2] > row[2]:
last_user_rank += 1
else:
break
print last_user, last_user_rank