I am trying to select a subset of columns from a table with sqlalchemy's load_only function. Unfortunately it doesn't seem to return only the columns specified in the functional call - specifically, it also seems to fetch the primary key (in my case, an auto_increment id field).
A simple example, if I use this statement to build a query,:
query = session.query(table).options(load_only('col_1', 'col_2'))
Then the query.statement looks like this:
SELECT "table".id, "table"."col_1", "table"."col_2"
FROM "table"
Which is not what I would have expected - given I've specified the "only" columns to use...Where did the id come from - and is there a way to remove it?
Deferring the primary key would not make sense, if querying complete ORM entities, because an entity must have an identity so that a unique row can be identified in the database table. So the query includes the primary key though you have your load_only(). If you want the data only, you should query for that specifically:
session.query(table.col1, table.col2).all()
The results are keyed tuples that you can treat like you would the entities in many cases.
There actually was an issue where having load_only() did remove the primary key from the select list, and it was fixed in 0.9.5:
[orm] [bug] Modified the behavior of orm.load_only() such that primary key columns are always added to the list of columns to be “undeferred”; otherwise, the ORM can’t load the row’s identity. Apparently, one can defer the mapped primary keys and the ORM will fail, that hasn’t been changed. But as load_only is essentially saying “defer all but X”, it’s more critical that PK cols not be part of this deferral.
Related
Most databases allow defining UNIQUE key (unique field) that is not a PRIMARY KEY, but DynamoDB does not seem to support unique key definitions.
For example, a model SampleModel defines an id field as a PRIMARY KEY (id = UnicodeAttribute(hash_key=True)). What if another field (let's say name) must also be defined as unique? Given that DynamoDB does not offer unique field specification, and only one PK (hash_key=True) is allowed - how can name be defined as UNIQUE?
As you've already seen, there's no direct support for that in DynamoDB.
If this name attribute is immutable after you write it, you could do the following:
You can create a second DynamoDB table that only has a Partition Key (Partition Key = Primary Key) and stores each name there. When you add an item to the first table, you use a transaction and have a separate insert into the second table that has a condition of the key not already existing. If this transaction fails, you were trying to put an item whose name already existed in the table. If the transaction goes through, you've inserted a new unique name.
This will incur extra cost for the second table and transactions and only works in this narrow use case.
DynamoDB is not built for this pattern, enforce uniqueness in another way.
You create a PK out of whatever value has to be unique (in the same table or second table) and use transactions to enforce it on insert.
https://aws.amazon.com/blogs/database/simulating-amazon-dynamodb-unique-constraints-using-transactions/
I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)
I am trying to bulk insert SQL-Alchemy Subclasses into the parent table and their respective tables ie fruits tables -> Apple Table and so I insert a table of APPLE and it will insert both the row into the fruits table then give me the id of the row in fruits table and put it Apple
This works when inserting one row at a time, but I need it to work with bulk insertion due to performance
I have tried to bulk insert which failed and I tried single row insertion it works with single row insertion but the thing is this data is not really unique except for the id of the row which is auto-generated so its going to be really hard to do a bulk insert to the parent table then do a bulk insert into the subclass table where the data matches and use the id by a mapping function
for data in apple_list:
db.session.add(Apple(
brand=data["brand"],
picked_date=data["picked_date"],
type=data["type"],
color=data["color"],
sub_type=data["sub_type"],
))
what I want is something more like bulk insertion
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list)
Actual results are that when it tries to insert it errors out on the insertion due to it not having the foreign primary key that tells the row for the fruits table to the apple table
Expect to insert without any errors and to populate both tables like when inserting both rows
I figured it out so SQLALCHEMY has a parameter in bulk insert mapping called return_defaults
WARNING: This is straight from the docs return_defaults – when True, rows that are missing values which generate defaults, namely integer primary key defaults and sequences, will be inserted one at a time, so that the primary key value is available. In particular this will allow joined-inheritance and other multi-table mappings to insert correctly without the need to provide primary key values ahead of time; however, Session.bulk_insert_mappings.return_defaults greatly reduces the performance gains of the method overall. If the rows to be inserted only refer to a single table, then there is no reason this flag should be set as the returned default information is not used.
so then all you have to do is this
db.session.bulk_insert_mappings(model_classes['Apple'], apple_list, return_defaults=True)
it's still alot faster than db.session.add
I have an object which has foreign keys to another object. Let's call the first object A and the second object B. A can be represented as (integer id, integer b_id, integer some_data) and B can be represented as (integer id, integer datum_one, integer datum_two)
datum_one and datum_two form a unique composite in B (I think this is the right term - I don't want more than on entry in B with the same combination of these fields). However, I just want to reference the id in the ForeignKey pointing from A to B. It seems tedious and redundant to do something like a composite foreign key like here.
I want to have functionality such that when I add A to my database, it first searches for a matching B which has the same datum_one and datum_two. If this exists, it uses the matching entry in the database, and otherwise it adds a new row to the table representing B.
Is there a way to accomplish this? I suspect the solution have have something to do with the merge directive, but I'm not sure how to get this functionality exactly.
One potential solution I considered was actually using the UNIQUE directive, but it seems like SQLAlchemy doesn't play nice with unique - I would basically need to just write my own error handling, which is an option but I was hoping SQLAlchemy would have a more elegant solution to this.
I think you should just handle this yourself before inserting your object to the database. From the docs... http://docs.sqlalchemy.org/en/latest/core/events.html
from sqlalchemy import event
# standard decorator style
#event.listens_for(SomeSchemaClassOrObject, 'before_create')
def receive_before_create(target, connection, **kw):
"listen for the 'before_create' event"
# ... (event handling logic) ...
I have table with unique contraint on one colum like:
CREATE TABLE entity (
id INT NOT NULL AUTO_INCREMENT,
zip_code INT NOT NULL,
entity_url VARCHAR(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY ix_uniq_zip_code_entity_url (zip_code, entity_url)
);
and corresponding SQLAlchemy model. I adding a lot of records and do not want to commit session after each record. My assumption that better to call session.add(new_record) multiple times and one time session.commit().
But while adding new records I could get IntegrityError because constraint is violated. It is normal situation and I want just skip such records insertion. But looks like I can only revert entire transaction.
Also I do not want to add another complex checks "get all records from database where zip_code in [...] and entity_url in [...] then drop matched data from records_to_insert".
Is there way to instuct SQLAlchemy drop records which violates constraint?
My assumption that better to call session.add(new_record) multiple times and one time session.commit().
You might want to revisit this assumption. Batch processing of a lot of records usually lends itself to multiple commits -- what if you have 10k records and your code raises an exception on the 9,999th? You'll be forced to start over. The core question here is whether or not it makes sense for one of the records to exist in the database without the rest. If it does, then there's no problem committing on each entry (performance issues aside). In that case, you can simply catch the IntegrityError and call session.rollback() to continue down the list of records.
In any case, a similar question was asked on the SQLA mailing list and answered by the library's creator, Mike Bayer. He recommended removing the duplicates from the list of new records yourself, as this is easy to do with a dictionary or a set. This could be as simple as a dict comprehension:
new_entities = { (entity['zip_code'], entity['url']): entity for entity in new_entities}
(This would pick the last-seen duplicate as the one to add to the DB.)
Also note that he uses the SQLAlchemy core library to perform the inserts, rather than the ORM's session.add() method:
sess.execute(Entry.__table__.insert(), params=inserts)
This is a much faster option if you're dealing with a lot of records (like in his example, with 100,000 records).
If you decide to insert your records row by row, you can check if it already exists before you do your insert. This may or may not be more elegant and efficient:
def record_exists(session, some_id):
return session.query(exists().where(YourEntity.id == some_id)).scalar()
for item in items:
if not record_exists(session, item.some_id):
session.add(record)
session.flush()
else:
print "Already exists, skipping..."
session.commit()