How to Bulk Insert SQLAlchemy subclasses (Joined Inheritance) - python

I am trying to bulk insert SQL-Alchemy Subclasses into the parent table and their respective tables ie fruits tables -> Apple Table and so I insert a table of APPLE and it will insert both the row into the fruits table then give me the id of the row in fruits table and put it Apple
This works when inserting one row at a time, but I need it to work with bulk insertion due to performance
I have tried to bulk insert which failed and I tried single row insertion it works with single row insertion but the thing is this data is not really unique except for the id of the row which is auto-generated so its going to be really hard to do a bulk insert to the parent table then do a bulk insert into the subclass table where the data matches and use the id by a mapping function
for data in apple_list:
db.session.add(Apple(
brand=data["brand"],
picked_date=data["picked_date"],
type=data["type"],
color=data["color"],
sub_type=data["sub_type"],
))
what I want is something more like bulk insertion
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list)
Actual results are that when it tries to insert it errors out on the insertion due to it not having the foreign primary key that tells the row for the fruits table to the apple table
Expect to insert without any errors and to populate both tables like when inserting both rows

I figured it out so SQLALCHEMY has a parameter in bulk insert mapping called return_defaults
WARNING: This is straight from the docs return_defaults – when True, rows that are missing values which generate defaults, namely integer primary key defaults and sequences, will be inserted one at a time, so that the primary key value is available. In particular this will allow joined-inheritance and other multi-table mappings to insert correctly without the need to provide primary key values ahead of time; however, Session.bulk_insert_mappings.return_defaults greatly reduces the performance gains of the method overall. If the rows to be inserted only refer to a single table, then there is no reason this flag should be set as the returned default information is not used.
so then all you have to do is this
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list, return_defaults=True)
it's still alot faster than db.session.add

Related

How to get the data object of a newly inserted data row and flask-mysqldb?

I have work in Perl where I am able to get the newly created data object ID by passing the result back to a variable. For example:
my $data_obj = $schema->resultset('PersonTable')->create(\%psw_rec_hash);
Where the $data_obj contains the primary key's column value.
I want to be able to do the same thing using Python 3.7, Flask and flask-mysqldb,
but without having to do another query. I want to be able to use the specific
record's primary key column value for another method.
Python and flask-mysqldb inserts data like so:
query = "INSERT INTO PersonTable (fname, mname, lname) VALUES('Phil','','Vil')
cursor = db.connection.cursor()
cursor.execute(query)
db.connection.commit()
cursor.close()
The PersonTable has a primary key column called, id. So, the newly inserted data row would look
like:
23, 'Phil', 'Vil'
Because there are 22 rows of data before the last inserted data, I don't want to perform a search
for the data, because there could be more than one entry with the same data. However, all I want
the most recent data row.
Can I do something similar to Perl with python 3.7 and flask-mysqldb?
You may want to consider the Flask-SQLAlchemy package to help you with this.
Although the syntax is going to be slightly different from Perl, what you can do is, when you create the model object, you can set it to a variable. Then, when you either flush or commit on the Database session, you can pull up your primary key attribute on that model object you had created (whether it's "id" or something else), and use it as needed.
SQLAlchemy supports MySQL, as well as several other relational databases. In addition, it is able to help prevent SQL injection attacks so long as you use model objects and add/delete them to your database session, as opposed to straight SQL commands.

Best way to perform bulk insert SQLAlchemy

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

SQLalchemy "load_only" doesn't load only the columns specified

I am trying to select a subset of columns from a table with sqlalchemy's load_only function. Unfortunately it doesn't seem to return only the columns specified in the functional call - specifically, it also seems to fetch the primary key (in my case, an auto_increment id field).
A simple example, if I use this statement to build a query,:
query = session.query(table).options(load_only('col_1', 'col_2'))
Then the query.statement looks like this:
SELECT "table".id, "table"."col_1", "table"."col_2"
FROM "table"
Which is not what I would have expected - given I've specified the "only" columns to use...Where did the id come from - and is there a way to remove it?
Deferring the primary key would not make sense, if querying complete ORM entities, because an entity must have an identity so that a unique row can be identified in the database table. So the query includes the primary key though you have your load_only(). If you want the data only, you should query for that specifically:
session.query(table.col1, table.col2).all()
The results are keyed tuples that you can treat like you would the entities in many cases.
There actually was an issue where having load_only() did remove the primary key from the select list, and it was fixed in 0.9.5:
[orm] [bug] Modified the behavior of orm.load_only() such that primary key columns are always added to the list of columns to be “undeferred”; otherwise, the ORM can’t load the row’s identity. Apparently, one can defer the mapped primary keys and the ORM will fail, that hasn’t been changed. But as load_only is essentially saying “defer all but X”, it’s more critical that PK cols not be part of this deferral.

I am delete object of model with pk=1, but new object have pk=2 [duplicate]

I have got a table with auto increment primary key. This table is meant to store millions of records and I don't need to delete anything for now. The problem is, when new rows are getting inserted, because of some error, the auto increment key is leaving some gaps in the auto increment ids.. For example, after 5, the next id is 8, leaving the gap of 6 and 7. Result of this is when I count the rows, it results 28000, but the max id is 58000. What can be the reason? I am not deleting anything. And how can I fix this issue.
P.S. I am using insert ignore while inserting records so that it doesn't give error when I try to insert duplicate entry in unique column.
This is by design and will always happen.
Why?
Let's take 2 overlapping transaction that are doing INSERTs
Transaction 1 does an INSERT, gets the value (let's say 42), does more work
Transaction 2 does an INSERT, gets the value 43, does more work
Then
Transaction 1 fails. Rolls back. 42 stays unused
Transaction 2 completes with 43
If consecutive values were guaranteed, every transaction would have to happen one after the other. Not very scalable.
Also see Do Inserted Records Always Receive Contiguous Identity Values (SQL Server but same principle applies)
You can create a trigger to handle the auto increment as:
CREATE DEFINER=`root`#`localhost` TRIGGER `mytable_before_insert` BEFORE INSERT ON `mytable` FOR EACH ROW
BEGIN
SET NEW.id = (SELECT IFNULL(MAX(id), 0) + 1 FROM mytable);;
END
This is a problem in the InnoDB, the storage engine of MySQL.
It really isn't a problem as when you check the docs on “AUTO_INCREMENT Handling in InnoDB” it basically says InnoDB uses a special table to do the auto increments at startup
And the query it uses is something like
SELECT MAX(ai_col) FROM t FOR UPDATE;
This improves concurrency without really having an affect on your data.
To not have this use MyISAM instead of InnoDB as storage engine
Perhaps (I haven't tested this) a solution is to set innodb_autoinc_lock_mode to 0.
According to http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html this might make things a bit slower (if you perform inserts of multiple rows in a single query) but should remove gaps.
You can try insert like :
insert ignore into table select (select max(id)+1 from table), "value1", "value2" ;
This will try
insert new data with last unused id (not autoincrement)
if in unique fields duplicate entry found ignore it
else insert new data normally
( but this method not support to update fields if duplicate entry found )

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

Categories