psycopg2 - fastest way to insert rows to multiple tables? - python

I am currently using this code:
while True:
col_num = 0
for table in table_names:
cursor.execute("INSERT INTO public.{0} VALUES(CURRENT_TIMESTAMP, 999999)".format(table))
cursor.connection.commit()
col_num += 1
row_num += 1
And this is pretty slow. One of the problem I see is that its committing multiple times to account for each table. If I can commit for all tables in a single query, I think that would increase the performance. How should I go about this?

You can commit outside the loop:
for table in table_names:
cursor.execute("INSERT INTO public.{0} VALUES(CURRENT_TIMESTAMP, 999999)".format(table))
cursor.connection.commit()
However, there is a side effect. First columns (timestamps) will have different values when committed separately in contrast to the same value when committed together. This is because CURRENT_TIMESTAMP gives the time of the start of transaction.

Related

Why is this sql statement super slow?

I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.

Iteratively INSERTing from a Dataframe

My question may be out of pure ignorance. Given an arbitrary dataframe of say 5 rows. I want to insert that dataframe into a DB (in my case it's postgresSQL). General code to do that is along the lines of:
postgres_insert_query = """ INSERT INTO table (ID, MODEL, PRICE) VALUES (%s,%s,%s)""" record_to_insert = (1, 'A', 100) cursor.execute(postgres_insert_query, record_to_insert)
Is it a common practice that when inserting more than one row of data, you iterate over your rows and do that?
It appears that every article or example I see is about inserting a single row to a DB.
In python you could simply loop over your data frame and then do your inserts.
for record in dataframe:
sql = '''INSERT INTO table (col1, col2, col3)
VALUES ('{}', '{}', '{}')
'''.format(record[1], record[0], record[2])
dbo.execute(sql)
This is highly simplistic. You may want to use something like sqlalchemy and make surre you use prepared statements. Never overlook security.

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?
Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.
You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

Update table in sql server, with values in pandas dataframe

i have a table in sql server
id count
1 1
2 1
3 1
4 1
5 1
i have another table in pandas dataframe(df), with updated count
id count
1 1
2 1
3 2
4 3
5 4
i want to make changes in my database using Update query, and i am thinking to define a function, which would do this.
i am using pypyodbc for my connection.
conn = pypyodbc.connect("Driver={SQL Server};Server=<YourServer>;Database=<YourDatabase>;uid=<YourUserName>;pwd=<YourPassword>"
i tried using
for row in df.iterrows():
updateQuery = "update "+db_table+" set count="+str(row[1][1])+" where id= '"+str(row[1][0])+"'"
cursor.execute(updateQuery)
conn.commit()
But is there any better way of doing this?
What you are trying to do in your question is an iterative update, looping through each row one by one. SQL databases are very inefficient at this kind of operation. SQL databases are very efficient at set based updates though.
What this means in practice, is write a script that can apply to the whole table in one go and then run it once.
In your case you can demonstrate this as follows in some volatile table variable from a query within SQL Server Management Studio (SSMS):
First, create your test data:
declare #t table(id int, [count] int);
insert into #t values(1,1),(2,1),(3,1),(4,1),(5,1);
declare #p table(id int, [count] int);
insert into #p values(1,1),(2,1),(3,2),(4,3),(5,4);
The first two select statements show you what the data looks like before your update:
select *
from #t;
select *
from #p;
To update all your rows of data in your SQL Table, you can join it to the data that was in your Pandas table, and then select the data again so you can see the change that took place in the update:
update t
set [count] = p.[count]
from #t as t
join #p as p
on(t.id = p.id);
select *
from #t;
select *
from #p;
I am not too familiar with Pandas Dataframe so do not know exactly how you can access and query this data. If you are working on large datasets, I would recommend importing the Pandas data as is into a staging table in your SQL Server database and running the above type of query within SQL Server.

SQL delete duplicate rows [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Delete duplicate rows
Here is my table structure:
"Author" (varchar) | "Points" (integer) | "Body" (text)
The author is always the same and the Body is too. The same author entry will appear multiple times throughout the database with different bodies, so I cannot delete according to the author. However, the points column isn't always the same. I want the keep the row with the largest point value.
I am using SQLite3 and Python.
Thanks
EDIT:
I have tried this, but it just deletes all the rows.
for row in cur.fetchall():
rows = cur.execute('SELECT * FROM Posts WHERE Author=? AND Body=? AND Nested=? AND Found=?', (row['Author'], row['Body'], row['Nested'], row['Found'],))
for row2 in rows:
delrow = row
if (row['Upvotes'] < row2['Upvotes'] or row['Downvotes'] < row2['Downvotes']):
delrow = row2
cur.execute("DELETE FROM Posts WHERE Author=? AND Body=? AND Upvotes=? AND Downvotes=? AND Nested=? AND Found=?", (delrow['Author'], delrow['Body'], delrow['Upvotes'], delrow['Downvotes'], delrow['Nested'], delrow['Found'],))
dn += 1
print "Deleted row ", dn
I have also tried this, but it didn't work.
cur.execute("DELETE FROM Posts WHERE Upvotes NOT IN (SELECT MAX(Upvotes) FROM Posts GROUP BY Body);")
I am also committing all the changes so it is not that. The SQLite3 module is installed correctly and I can write on the db.
Unfortunately in SQLite3 you don't have nice functions like partition over row so there's no way to do it in one query so you'll either have to do it procedurally or iteratively.
For performance reasons I would recommend extracting your full list of deletion potentials and then delete them en-masse, for eg.
# in your sql query
SELECT ROWID, AUTHOR, BODY
FROM TABLE_NAME
ORDER BY AUTHOR, BODY, POINTS DESC
Then in your Python application, iterate through your result set, and store all non-first ROWIDs for the Author/Body combo (think CTRL-BREAK style programming), and once you're done building your set delete the row IDs.
Since you want to delete all but the highest points value, the following will do it just fine:
delete from test
where exists (select * from test t2
where test.author = t2.author
and test.body = t2.body
and test.points < t2.points);
It's a basic join to itself, and then deleting out all values that have the same author & body, but have a lower point value.
SqlFiddle here: http://sqlfiddle.com/#!7/64d62/3
Note: The one caveat, is that if multiple author/body pairs have the same max point value, then all those values will be preserved.
I haven't tested it, but this may work:
DELETE FROM TableName
WHERE author, body, points NOT IN (SELECT author, body, MAX(points) as points
FROM TableName
GROUP BY author, body)
Run it as a SELECT query first to see if it will keepwhat you want.

Categories