Bulk insert with returning IDs performance

Bulk insert with returning IDs performance - python

I've got a table which I want to insert around 1000 items each query, and get their PK after creation, for later use as FK for other tables.
I've tried inserting them using returning syntax in postgresql.
but it takes around 10 sec to insert
INSERT INTO table_name (col1, col2, col3) VALUES (a1,a2,a3)....(a(n-2),a(n-1),a(n)) returning id;
By removing RETURNING I get much better performance ~50ms.
I think that if I can get an atomic operation to get the first id and insert the rows at the same time I could get better performance by removing the RETURNING.
but don't understand if that is possible.

Generate id using the nextval
http://www.postgresql.org/docs/9.1/static/sql-createsequence.html
CREATE TEMP TABLE temp_val AS(
VALUES(nextval('table_name_id_seq'),a1,a2,a3),
(nextval('table_name_id_seq'),a1,a2,a3)
);
INSERT INTO table_name (id, col1, col2, col3)(
SELECT column1,column2,column3,column4
FROM temp_val
);
SELECT column1 FROM temp_val;

Related

INSERT SQL records from one database to a second (where 2nd has an additional column)

I'd like to insert select records from Table A --> Table B (in this example case, different "databases" == different tables to not worry about ATTACH), where Table A has less columns than Table B. The additional B_Table column (col3) should also be populated.
I've tried this sequence in raw-SQL (through SQLAlch.):
1.) INSERTing A_Table into Table B using an engine.connect().execute(text)
text("INSERT INTO B_Table (col1, col2) SELECT col1, col2 FROM A_Table")
2.) UPDATEing B_Table w/ col3 info with an engine.connect()ion (all newly inserted records are populated/updated w/ the same identifier, NewInfo)
text("UPDATE B_Table SET col3 = NewInfo WHERE B_Table.ID >= %s" % (starting_ID#_of_INSERT'd_records))
More efficient alternative?
But this is incredibly inefficient. It takes 4x longer to UPDATE a single column than to INSERT. This seems like it should be a fraction of the INSERT time. I'd like to reduce the total time to ~just the insertion time.
What's a better way to copy data from one table to another w/out INSERTing followed by an UPDATE? I was considering a:
1.) SQLAlchemy session.query(A_Table), but wasn't sure how to then edit that object (for col3) and then insert that updated object w/out loading all the A_Table queried info into RAM (which I understand raw-SQL's INSERT does not do).

You can use 'NewInfo' as a string literal in the SELECT statement:
INSERT INTO B_Table (col1, col2, col3)
SELECT col1, col2, 'NewInfo'
FROM A_Table;

Iteratively INSERTing from a Dataframe

My question may be out of pure ignorance. Given an arbitrary dataframe of say 5 rows. I want to insert that dataframe into a DB (in my case it's postgresSQL). General code to do that is along the lines of:
postgres_insert_query = """ INSERT INTO table (ID, MODEL, PRICE) VALUES (%s,%s,%s)""" record_to_insert = (1, 'A', 100) cursor.execute(postgres_insert_query, record_to_insert)
Is it a common practice that when inserting more than one row of data, you iterate over your rows and do that?
It appears that every article or example I see is about inserting a single row to a DB.

In python you could simply loop over your data frame and then do your inserts.
for record in dataframe:
sql = '''INSERT INTO table (col1, col2, col3)
VALUES ('{}', '{}', '{}')
'''.format(record[1], record[0], record[2])
dbo.execute(sql)
This is highly simplistic. You may want to use something like sqlalchemy and make surre you use prepared statements. Never overlook security.

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?

Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.

You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

What's the fastest way to see if a table has no rows in postgreSQL?

I have a bunch of tables that I'm iterating through, and some of them have no rows (i.e. just a table of headers with no data).
ex: SELECT my_column FROM my_schema.my_table LIMIT 1 returns an empty result set.
What is the absolute fastest way to check that a table is one of these tables with no rows?
I've considered: SELECT my_column FROM my_schema.my_table LIMIT 1 or SELECT * FROM my_schema.my_table LIMIT 1
followed by an if result is None(I'm working in Python). Is there any faster way to check?

This is not faster than your solution but returns a boolean regadless:
select exists (select 1 from mytable)

select exists (select * from myTab);
or
select 1 where exists (select * from myTab)
or even
SELECT reltuples FROM pg_class WHERE oid = 'schema_name.table_name'::regclass;
The 3rd example uses the estimator to estimate rows, which may not be 100% accurate, but may be a tad bit faster.

SELECT COUNT(*) FROM table_name limit 1;
Try this code .

select all id's and insert missing data

I have a database where i store some values with a auto generated index key. I also have a n:m mapping table like this:
create table data(id int not null identity(1,1), col1 int not null, col2 varchar(256) not null);
create table otherdata(id int not null identity(1.1), value varchar(256) not null);
create table data_map(dataid int not null, otherdataid int not null);
every day the data table needs to be updated with a list of new values, where a lot of them are already present but needs to be inserted into the data_map (the key in otherdata is then generated, so in this table the data is always new).
one way of doing it would be to first try to insert all values, then select the generated id, then insert into data_map:
mydata = [] # list of tuples
cursor.executemany("if not exists (select * from data where col1 = %d and col2 = %d) insert into data (col1, col2) values (%d, %d)", mydata);
# now select the id's
# [...]
but that obviously is quite bad because i need to select all things without using the key and also i need to do the check without using the key, so i need indexed data first, otherwise everything is very slow.
my next approach was to use a hashfunction (like md5 or crc64) to generate my own hash over col1 and col2, to be able to insert all values without using a select and be able to use the indexed key when inserting missing values.
can this be optimized or is it the best thing i could do?
the amout of lines is >500k per change, where maybe ~20-50% will be already in the database.
timing wise it looks like that calculating the hashes is much faster than inserting data into the database.

As far as I concern, you use mysql.connector. If it is, when you run cursor.execute() you should not use %d types. Everything should be just %s and connector will do this job about type conversions

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bulk insert with returning IDs performance - python

Related

INSERT SQL records from one database to a second (where 2nd has an additional column)

Iteratively INSERTing from a Dataframe

How to delete large quantity of records from Oracle Table that has no primary key

What's the fastest way to see if a table has no rows in postgreSQL?

select all id's and insert missing data

Categories

Resources