Bulk inserts with Flask-SQLAlchemy - python

I'm using Flask-SQLAlchemy to do a rather large bulk insert of 60k rows. I also have a many-to-many relationship on this table, so I can't use db.engine.execute for this. Before inserting, I need to find similar items in the database, and change the insert to an update if a duplicate item is found.
I could do this check beforehand, and then do a bulk insert via db.engine.execute, but I need the primary key of the row upon insertion.
Currently, I am doing a db.session.add() and db.session.commit() on each insert, and I get a measly 3-4 inserts per second.
I ran a profiler to see where the bottleneck is, and it seems that the db.session.commit() is taking 60% of the time.
Is there some way that would allow me to make this operation faster, perhaps by grouping commits, but which would give me primary keys back?
This is what my models looks like:
class Item(db.Model):
id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.String(1024), nullable=True)
created = db.Column(db.DateTime())
tags_relationship = db.relationship('Tag', secondary=tags, backref=db.backref('items', lazy='dynamic'))
tags = association_proxy('tags_relationship', 'text')
class Tag(db.Model):
id = db.Column(db.Integer, primary_key=True)
text = db.Column(db.String(255))
My insert operation is:
for item in items:
if duplicate:
update_existing_item
else:
x = Item()
x.title = "string"
x.created = datetime.datetime.utcnow()
for tag in tags:
if not tag_already_exists:
y = Tag()
y.text = "tagtext"
x.tags_relationship.append(y)
db.session.add(y)
db.session.commit()
else:
x.tags_relationship.append(existing_tag)
db.session.add(x)
db.session.commit()

Perhaps you should try to db.session.flush() to send the data to the server, which means any primary keys will be generated. At the end you can db.session.commit() to actually commit the transaction.

I use the following code to quickly read the content of a pandas DataFrame into SQLite. Note that it circumvents the ORM features of SQLAlchemy. myClass in this context is a db.Model derived class that has a tablename assigned to it. As the code snippets mentions, I adapted
l = df.to_dict('records')
# bulk save the dictionaries, circumventing the slow ORM interface
# c.f. https://gist.github.com/shrayasr/5df96d5bc287f3a2faa4
connection.engine.execute(
myClass.__table__.insert(),
l
)

from app import db
data = [{"attribute": "value"}, {...}, {...}, ... ]
db.engine.execute(YourModel.__table__.insert(), data)
for more information refer https://gist.github.com/shrayasr/5df96d5bc287f3a2faa4

Related

Preventing duplicate child entries in an ORM relationship

Basically I have a service that reads from a spreadsheet and inserts into database.
In SQLAlchemy I have the following relationship
class Customer(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
name = Column(String)
children = relationship('Email', backref=('customer')
class Email(Base):
__tablename__ = 'emails'
id = Column(Integer, primary_key=True)
customer = Column(Integer, ForeignKey('customer.id'))
email = Column(String)
primary = Column(Boolean)
Is it possible for SQLAlchemy to check for a duplicate entry between a fetched resource and one created in the ORM?
For example let's say customer 123 has an email some_email, and we try to add it again:
email_object = Email(customer=123, email='some_email', primary=True)
cust = connection.query(Customer).options(joinedload(Customer.emails)).filter_by(
id=123).first()
cust.emails.append(email_object)
Ideally I would like SQLAlchemy to either notice that such a combination exists and merge/ignore it, or throw some kind of exception.
But instead I'm getting the following result if I print out cust.emails
[<Email(id=1, email=some_email, primary=True, customer=123>),
<Email(customer=192071, email='some_email', primary=True, customers=<Employee(id=123, name='John', emails=['some_email', 'some_email']>>)]
and doing a merge and commit just seems to add an extra identical row in the database (except for the pk).
I think maybe it has to do with the unused primary key in Emails, but that is autogenerated when committing to the DB.
Any ideas?
Lemme know if I need to clarify anything.
Setting the Email class to have two primary keys doesn't seem to make SQLAlchemy stop from appending the extra email
That's correct. Using a composite primary key on (customer_id, email) does not prevent SQLAlchemy from trying to insert a new object that essentially duplicates an existing email — although it will warn you if an object with the same primary key already exists in the identity map — and the INSERT will fail (throw an exception and be rolled back) because of the duplicate PK, thus preventing the duplicate child record.
If you want to check whether an email exists before trying to add it you can either use session.get() …
with Session(engine) as session:
# retrieve John's object
john = (
session.execute(select(Customer).where(Customer.name == "John"))
.scalars()
.one()
)
print(john) # <Customer(id=123, name='John')>
# check if email already exists using .get()
email = session.get(Email, (john.id, "some_email"))
if email:
print(f"email already exists: {email}")
# email already exists: <Email(customer_id=123, email='some_email')>
else:
print("email does not already exist")
… or a relationship in Customer could provide the existing emails, allowing you to search for the one you want to add
# alternative (less efficient) method: check via relationship
e_list = [e for e in john.emails if e.email == "some_email"]
if e_list: # list not empty
print("email already exists")
else:
print("email does not already exist")

(pymysql.err.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails...') in Flask

I'm aware that there are many questions about the error in the title, but I could't find a suitable solution. My problem is that while deleting a row using Session.delete() it's throwing
sqlalchemy.exc.IntegrityError: (pymysql.err.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`transport`.`driver`, CONSTRAINT `driver_ibfk_1` FOREIGN KEY (`owner_id`) REFERENCES `truckcompany` (`id`))') [SQL: 'DELETE FROM truckcompany WHERE truckcompany.id = %(id)s'] [parameters: {'id': 4}]
Models:
class Truck_company(Base):
__tablename__ = 'truckcompany'
id = Column(BigInteger, primary_key=True)
class Driver(Base):
__tablename__ = 'driver'
id = Column(BigInteger, primary_key=True)
owner_id = Column(BigInteger, ForeignKey('truckcompany.id'))
owner = relationship(Truck_company)
The view with the failing delete:
#app.route('/values/deleteuser/<int:id>', methods=['POST', 'GET'])
def delete_truck(id):
value_truckcompany = sqlsession.query(Truck_company).filter(Truck_company.id == id).first()
if value_truckcompany:
sqlsession.delete(value_truckcompany)
sqlsession.commit()
return redirect('/static/truckcompanyview', )
Why
In your Driver model there's a foreign key constraint referencing Truck_company:
class Driver(Base):
...
owner_id = Column(BigInteger, ForeignKey('truckcompany.id'))
You've omitted the ON DELETE action, so MySQL defaults to RESTRICT. Also there are no SQLAlchemy ORM relationships with cascades that would delete the related drivers. So when you try to delete the truck company in the view, the DB stops you from doing that, because you'd violate your foreign key constraint and in other words referential integrity. This is an issue with how you've modeled your DB, not Flask etc.
What can I do
The most important thing to do – when you're creating your model – is to decide what would you like to happen when you delete a truck company with related drivers. Your options include, but are not limited to:
Deleting the drivers also.
Setting their owner_id to NULL, effectively detaching them. This is what SQLAlchemy does, if an ORM relationship is present in its default configuration in the parent.
It is also a perfectly valid solution to restrict deleting parent rows with children, as you've implicitly done.
You've expressed in the comments that you'd like to remove the related drivers. A quick solution is to just manually issue a DELETE:
# WARNING: Allowing GET in a data modifying view is a terrible idea.
# Prepare yourself for when Googlebot, some other spider, or an overly
# eager browser nukes your DB.
#app.route('/values/deleteuser/<int:id>', methods=['POST', 'GET'])
def delete_truck(id):
value_truckcompany = sqlsession.query(Truck_company).get(id)
if value_truckcompany:
sqlsession.query(Driver).\
filter_by(owner=value_truckcompany).\
delete(synchronize_session=False)
sqlsession.delete(value_truckcompany)
sqlsession.commit()
return redirect('/static/truckcompanyview', )
This on the other hand fixes this one location only. If you decide that a Driver has no meaning without its Truck_company, you could alter the foreign key constraint to include ON DELETE CASCADE, and use passive deletes in related SQLAlchemy ORM relationships:
class Truck_company(Base):
...
# Remember to use passive deletes with ON DELETE CASCADE
drivers = relationship('Driver', passive_deletes=True)
class Driver(Base):
...
# Let the DB handle deleting related rows
owner_id = Column(BigInteger, ForeignKey('truckcompany.id',
ondelete='CASCADE'))
Alternatively you could leave it to the SQLAlchemy ORM level cascades to remove related objects, but it seems you've had some problems with that in the past. Note that the SQLAlchemy cascades define how an operation on the parent should propagate to its children, so you define delete and optionally delete-orphan on the parent side relationship, or the one-to-many side:
class Truck_company(Base):
...
# If a truck company is deleted, delete the related drivers as well
drivers = relationship('Driver', cascade='save-update, merge, delete')
In your current model you have no relationship defined from Truck_company to Driver, so no cascades take place.
Note that modifying Driver such as:
class Driver(Base):
...
owner_id = Column(BigInteger, ForeignKey('truckcompany.id',
ondelete='CASCADE'))
will not magically migrate the existing DB table and its constraints. If you wish to take that route, you'll have to either migrate manually or using some tool.

Flask-SQLAlchemy - adding a new column to a query to pass to a template

For a simple library app in Flask with Flask-SQLAlchemy, I have a Book table and a Checkout table:
class Checkout(db.Model):
id = db.Column(db.Integer, primary_key=True)
checkout_date = db.Column(db.DateTime, nullable=False, default=func.now())
return_date = db.Column(db.DateTime)
book_id = db.Column(db.Integer, db.ForeignKey('book.id'))
book = db.relationship('Book',
backref=db.backref('checkout', lazy='dynamic'))
def __repr__(self):
return '<Checkout Book ID=%r from %s to %s>' % (self.book.id, self.checkout_date, self.return_date)
I want to keep all checkout records, and I figure the normalized way to do this is to use Checkout.return_date to determine if the book is returned or not. If the return date of any associated Checkout record is null, the Book is still checked out and if it has no null records, the book is "available". (Really it should never have two null return_dates)
The SQL for this view would be:
select book.*, min(checkout.return_date is not null) as available
from book
left join checkout
on book.id = checkout.book_id
group by book.id;
I can't figure out how to do this in SQLAlchemy without copping out and using a raw SQL string:
db.session.query(Book).from_statement('select book.*, min(checkout.return_date is not null) as available from book left join checkout on book.id = checkout.book_id group by book.id').all()
But I can't access spam_book.available: I get the error 'Book' object has no attribute 'available'
Is there a way to add a dynamic, temporary attribute on to a Book to pass to the template? Or is there a better way to go about doing this?
My end goal is to be able to do {% if book.available %} for each book, in the template.
From your description, I understand your Book to Checkout relationship is OneToMany and you are doing:
Only one book can be checked out for one record
You library could have more than one copies of one book
Different copies of one book will have different and unique book_id
Try to add a method for your Book model.
class Book(db.Model):
# ...something you already have
def is_available(self):
return not self.checkout.filter(Checkout.return_date.isnot(None)).scalar()
In the template, just use:
{% if book.is_available %}
should do the trick.
This is not tested as I don't have SQLAlchemy at hand. Please tell me if it does not work.

can I store a Dictionary as the property of an object?

Using Python/Flask/SQLAlchemy/Heroku.
Want to store dictionaries of objects as properties of an object:
TO CLARIFY
class SoccerPlayer(db.Model):
name = db.Column(db.String(80))
goals_scored = db.Column(db.Integer())
^How can I set name and goals scored as one dictionary?
UPDATE: The user will input the name and goals_scored if that makes any difference.
Also, I am searching online for an appropriate answer, but as a noob, I haven't been able to understand/implement the stuff I find on Google for my Flask web app.
I would second the approach provided by Sean, following it you get properly
normalized DB schema and can easier utilize RDBMS to do the hard work for you. If,
however, you insist on using dictionary-like structure inside your DB, I'd
suggest to try out hstore
data type which allows you to store key/value pairs as a single value in
Postgres. I'm not sure if hstore extension is created by default in Postgres
DBs provided by Heroku, you can check that by typing \dx command inside
psql. If there are no lines with hstore in them, you can create it by
typing CREATE EXTENSION hstore;.
Since hstore support in SQLAlchemy is available in version 0.8 which is not
released yet (but hopefully will be in coming weeks), you need to install it
from its Mercurial repository:
pip install -e hg+https://bitbucket.org/sqlalchemy/sqlalchemy#egg=SQLAlchemy
Then define your model like this:
from sqlalchemy.dialects.postgresql import HSTORE
from sqlalchemy.ext.mutable import MutableDict
class SoccerPlayer(db.Model):
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(80), nullable=False, unique=True)
stats = db.Column(MutableDict.as_mutable(HSTORE))
# Note that hstore only allows text for both keys and values (and None for
# values only).
p1 = SoccerPlayer(name='foo', stats={'goals_scored': '42'})
db.session.add(p1)
db.session.commit()
After that you can do the usual stuff in your queries:
from sqlalchemy import func, cast
q = db.session.query(
SoccerPlayer.name,
func.max(cast(SoccerPlayer.stats['goals_scored'], db.Integer))
).group_by(SoccerPlayer.name).first()
Check out HSTORE docs
for more examples.
If you are storing such information in a database I would recommend another approach:
class SoccerPlayer(db.Model):
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(80))
team_id = db.Column(db.Integer, db.ForeignKey('Team.id'))
stats = db.relationship("Stats", uselist=False, backref="player")
class Team(db.Model):
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(80))
players = db.relationship("SoccerPlayer")
class Stats(db.Model):
id = db.Column(db.Integer, primary_key=True)
player_id = db.Column(db.Integer, db.ForeignKey('SoccerPlayer.id'))
goals_scored = db.Column(db.Integer)
assists = db.Column(db.Integer)
# Add more stats as you see fit
With this model setup you can do crazy things like this:
from sqlalchemy.sql import func
max_goals_by_team = db.session.query(Team.id,
func.max(Stats.goals_scored).label("goals_scored")
). \
join(SoccerPlayer, Stats). \
group_by(Team.id).subquery()
players = SoccerPlayer.query(Team.name.label("Team Name"),
SoccerPlayer.name.label("Player Name"),
max_goals_by_team.c.goals_scored). \
join(max_goals_by_team,
SoccerPlayer.team_id == max_goals_by_team.c.id,
SoccerPlayer.stats.goals_scored == max_goals_by_team.c.goals_scored).
join(Team)
thus making the database do the hard work of pulling out the players with the highest goals per team, rather than doing it all in Python.
Not even django(a bigger python web framework than flask) doesn't support this by default. But in django you can install it, it's called a jsonfield( https://github.com/bradjasper/django-jsonfield ).
What i'm trying to tell you is that not all databases know how to store binaries, but they do know how to store strings and jsonfield for django is actually a string that contains the json dump of a dictionary.
So, in short you can do in flask
import simplejson
class SoccerPlayer(db.Model):
_data = db.Column(db.String(1024))
#property
def data(self):
return simplejson.loads(self._data)
#data.setter
def data(self, value):
self._data = simplejson.dumps(value)
But beware, this way you can only assign the entire dictionary at once:
player = SoccerPlayer()
player.data = {'name': 'Popey'}
print player.data # Will work as expected
{'name': 'Popey'}
player.data['score'] = '3'
print player.data
# Will not show the score becuase the setter doesn't know how to input by key
{'name': 'Popey'}

sqlalchemy use of inheritance in postgres

in an attempt to learn sqlalchemy (and python), i am trying to duplicate an already existing project, but am having trouble figuring out sqlalchemy and inheritance with postgres.
here is an example of what our postgres database does (obviously, this is simplified):
CREATE TABLE system (system_id SERIAL PRIMARY KEY,
system_name VARCHAR(24) NOT NULL);
CREATE TABLE file_entry(file_entry_id SERIAL,
file_entry_msg VARCHAR(256) NOT NULL,
file_entry_system_name VARCHAR(24) REFERENCES system(system_name) NOT NULL);
CREATE TABLE ops_file_entry(CONSTRAINT ops_file_entry_id_pkey PRIMARY KEY (file_entry_id),
CONSTRAINT ops_system_name_check CHECK ((file_entry_system_name = 'ops'::bpchar))) INHERITS (file_entry);
CREATE TABLE eng_file_entry(CONSTRAINT eng_file_entry_id_pkey PRIMARY KEY (file_entry_id),
CONSTRAINT eng_system_name_check CHECK ((file_entry_system_name = 'eng'::bpchar)) INHERITS (file_entry);
CREATE INDEX ops_file_entry_index ON ops_file_entry USING btree (file_entry_system_id);
CREATE INDEX eng_file_entry_index ON eng_file_entry USING btree (file_entry_system_id);
And then the inserts would be done with a trigger, so that they were properly inserted into the child databases. Something like:
CREATE FUNCTION file_entry_insert_trigger() RETURNS "trigger"
AS $$
DECLARE
BEGIN
IF NEW.file_entry_system_name = 'eng' THEN
INSERT INTO eng_file_entry(file_entry_id, file_entry_msg, file_entry_type, file_entry_system_name) VALUES (NEW.file_entry_id, NEW.file_entry_msg, NEW.file_entry_type, NEW.file_entry_system_name);
ELSEIF NEW.file_entry_system_name = 'ops' THEN
INSERT INTO ops_file_entry(file_entry_id, file_entry_msg, file_entry_type, file_entry_system_name) VALUES (NEW.file_entry_id, NEW.file_entry_msg, NEW.file_entry_type, NEW.file_entry_system_name);
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
in summary, i have a parent table with a foreign key to another table. then i have 2 child tables that exist, and the inserts are done based upon a given value. in my example above, if file_entry_system_name is 'ops', then the row goes into the ops_file_entry table; 'eng' goes into eng_file_entry_table. we have hundreds of children tables in our production environment, and considering the amount of data, it really speeds things up, so i would like to keep this same structure. i can query the parent, and as long as i give it the right 'system_name', it immediately knows which child table to look into.
my desire is to emulate this with sqlalchemy, but i can't find any examples that go into this much detail. i look at the sql generated by sqlalchemy by examples, and i can tell it is not doing anything similar to this on the database side.
the best i can come up with is something like:
class System(_Base):
__tablename__ = 'system'
system_id = Column(Integer, Sequence('system_id_seq'), primary_key = True)
system_name = Column(String(24), nullable=False)
def __init(self, name)
self.system_name = name
class FileEntry(_Base):
__tablename__ = 'file_entry'
file_entry_id = Column(Integer, Sequence('file_entry_id_seq'), primary_key=True)
file_entry_msg = Column(String(256), nullable=False)
file_entry_system_name = Column(String(24), nullable=False, ForeignKey('system.system_name'))
__mapper_args__ = {'polymorphic_on': file_entry_system_name}
def __init__(self, msg, name)
self.file_entry_msg = msg
self.file_entry_system_name = name
class ops_file_entry(FileEntry):
__tablename__ = 'ops_file_entry'
ops_file_entry_id = Column(None, ForeignKey('file_entry.file_entry_id'), primary_key=True)
__mapper_args__ = {'polymorphic_identity': 'ops_file_entry'}
in the end, what am i missing? how do i tell sqlalchemy to associate anything that is inserted into FileEntry with a system name of 'ops' to go to the 'ops_file_entry' table? is my understanding way off?
some insight into what i should do would be amazing.
You just create a new instance of ops_file_entry (shouldn't this be OpsFileEntry?), add it into the session, and upon flush, one row will be inserted into table file_entry as well as table ops_file_entry.
You don't need to set the file_entry_system_name attribute, nor the trigger.
I don't really know python or sqlalchemy, but I figured I'd give it a shot for old times sake. ;)
Have you tried basically setting up your own trigger at the application level? Something like this might work:
from sqlalchemy import event, orm
def my_after_insert_listener(mapper, connection, target):
# set up your constraints to store the data as you want
if target.file_entry_system_name = 'eng'
# do your child table insert
   elseif target.file_entry_system_name = 'ops'
# do your child table insert
#…
mapped_file_entry_class = orm.mapper(FileEntry, 'file_entry')
# associate the listener function with FileEntry,
# to execute during the "after_insert" hook
event.listen(mapped_file_entry_class, 'after_insert', my_after_insert_listener)
I'm not positive, but I think target (or perhaps mapper) should contain the data being inserted.
Events (esp. after_create) and mapper will probably be helpful.

Categories