Django batching/bulk update_or_create?

Django batching/bulk update_or_create? - python

I have data in the database which needs updating periodically. The source of the data returns everything that's available at that point in time, so will include new data that is not already in the database.
As I loop through the source data I don't want to be making 1000s of individual writes if possible.
Is there anything such as update_or_create but works in batches?
One thought was using update_or_create in combination with manual transactions, but I'm not sure if that just queues up the individual writes or if it would combine it all into one SQL insert?
Or similarly could using #commit_on_success() on a function with update_or_create inside a the loop work?
I am not doing anything with the data other than translating it and saving it to a model. Nothing is dependent on that model existing during the loop.

Since Django added support for bulk_update, this is now somewhat possible, though you need to do 3 database calls (a get, a bulk create, and a bulk update) per batch. It's a bit challenging to make a good interface to a general purpose function here, as you want the function to support both efficient querying as well as the updates. Here is a method I implemented that is designed for bulk update_or_create where you have a number of common identifying keys (which could be empty) and one identifying key that varies among the batch.
This is implemented as a method on a base model, but can be used independently of that. This also assumes that the base model has an auto_now timestamp on the model named updated_on; if this is not the case, the lines of the code that assume this have been commented for easy modification.
In order to use this in batches, chunk your updates into batches before calling it. This is also a way to get around data that can have one of a small number of values for a secondary identifier without having to change the interface.
class BaseModel(models.Model):
updated_on = models.DateTimeField(auto_now=True)
#classmethod
def bulk_update_or_create(cls, common_keys, unique_key_name, unique_key_to_defaults):
"""
common_keys: {field_name: field_value}
unique_key_name: field_name
unique_key_to_defaults: {field_value: {field_name: field_value}}
ex. Event.bulk_update_or_create(
{"organization": organization}, "external_id", {1234: {"started": True}}
)
"""
with transaction.atomic():
filter_kwargs = dict(common_keys)
filter_kwargs[f"{unique_key_name}__in"] = unique_key_to_defaults.keys()
existing_objs = {
getattr(obj, unique_key_name): obj
for obj in cls.objects.filter(**filter_kwargs).select_for_update()
}
create_data = {
k: v for k, v in unique_key_to_defaults.items() if k not in existing_objs
}
for unique_key_value, obj in create_data.items():
obj[unique_key_name] = unique_key_value
obj.update(common_keys)
creates = [cls(**obj_data) for obj_data in create_data.values()]
if creates:
cls.objects.bulk_create(creates)
# This set should contain the name of the `auto_now` field of the model
update_fields = {"updated_on"}
updates = []
for key, obj in existing_objs.items():
obj.update(unique_key_to_defaults[key], save=False)
update_fields.update(unique_key_to_defaults[key].keys())
updates.append(obj)
if existing_objs:
cls.objects.bulk_update(updates, update_fields)
return len(creates), len(updates)
def update(self, update_dict=None, save=True, **kwargs):
""" Helper method to update objects """
if not update_dict:
update_dict = kwargs
# This set should contain the name of the `auto_now` field of the model
update_fields = {"updated_on"}
for k, v in update_dict.items():
setattr(self, k, v)
update_fields.add(k)
if save:
self.save(update_fields=update_fields)
Example usage:
class Event(BaseModel):
organization = models.ForeignKey(Organization)
external_id = models.IntegerField(unique=True)
started = models.BooleanField()
organization = Organization.objects.get(...)
updates_by_external_id = {
1234: {"started": True},
2345: {"started": True},
3456: {"started": False},
}
Event.bulk_update_or_create(
{"organization": organization}, "external_id", updates_by_external_id
)
Possible Race Conditions
The code above leverages a transaction and select-for-update to prevent race conditions on updates. There is, however, a possible race condition on inserts if two threads or processes are trying to create objects with the same identifiers.
The easy mitigation is to ensure that the combination of your common_keys and your unique_key is a database-enforced uniqueness constraint (which is the intended use of this function). This can be achieved with either the unique_key referencing a field with unique=True, or with the unique_key combined with a subset of the common_keys enforced as unique together by a UniqueConstraint). With database-enforced uniqueness protection, if multiple threads are trying to perform conflicting creates, all but one will fail with an IntegrityError. Due to the enclosing transaction, threads that fail will perform no changes and can be safely retried or ignored (a conflicting create that failed could just be treated as a create that happened first and then was immediately overwritten).
If leveraging uniqueness constraints is not possible, then you will either need to implement your own concurrency control or lock the entire table.

As of Django 4.1, the bulk_create method supports upserts via update_conflicts, which is the single query, batch equivalent of update_or_create:
class Foo(models.Model):
a = models.IntegerField(unique=True)
b = models.IntegerField()
queryset = [Foo(1, 1), Foo(1, 2)]
Foo.objects.bulk_create(
queryset,
update_conflicts=True,
unique_fields=['a'],
update_fields=['b'],
)

Batching your updates is going to be an upsert command and like #imposeren said, Postgres 9.5 gives you that ability. I think Mysql 5.7 does as well (see http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html) depending on your exact needs. That said it's probably easiest to just use a db cursor. Nothing wrong with that, it's there for when the ORM just isn't enough.
Something along these lines should work. It's psuedo-ish code, so don't just cut-n-paste this but the concept is there for ya.
class GroupByChunk(object):
def __init__(self, size):
self.count = 0
self.size = size
self.toggle = False
def __call__(self, *args, **kwargs):
if self.count >= self.size: # Allows for size 0
self.toggle = not self.toggle
self.count = 0
self.count += 1
return self.toggle
def batch_update(db_results, upsert_sql):
with transaction.atomic():
cursor = connection.cursor()
for chunk in itertools.groupby(db_results, GroupByChunk(size=1000)):
cursor.execute_many(upsert_sql, chunk)
Assumptions here are:
db_results is some kind of results iterator, either in a list or dictionary
A result from db_results can be fed directly into a raw sql exec statement
If any of the batch updates fail, you'll be rolling back ALL of them. If you want to move that to for each chunk, just push the with block down a bit

There is django-bulk-update-or-create library for Django that can do that.

I have been using the #Zags answer and I think it's the best solution. But I'd want to advice about a little issue in his code.
update_fields = {"updated_on"}
updates = []
for key, obj in existing_objs.items():
obj.update(unique_key_to_defaults[key], save=False)
update_fields.update(unique_key_to_defaults[key].keys())
updates.append(obj)
if existing_objs:
cls.objects.bulk_update(updates, update_fields)
If you are using auto_now=True fields they are not going to be updated if you use .update() or bulk_update() this is because the fields "auto_now" triggers with a .save() as you can read in the documentation.
In case you have an auto_now field F.e: updated_on, it will be better to add it explicitly in the unique_key_to_defaults dict.
"unique_value" : {
"field1.." : value...,
"updated_on" : timezone.now()
}...

Related

With SQLAlchemy, how can I convert a row to a "real" Python object?

I've been using SQLAlchemy with Alembic to simplify the database access I use, and any data structure changes I make to the tables. This has been working out really well up until I started to notice more and more issues with SQLAlchemy "expiring" fields from my point of view nearly at random.
A case in point would be this snippet,
class HRDecimal(Model):
dec_id = Column(String(50), index=True)
#staticmethod
def qfilter(*filters):
"""
:rtype : list[HRDecimal]
"""
return list(HRDecimal.query.filter(*filters))
class Meta(Model):
dec_id = Column(String(50), index=True)
#staticmethod
def qfilter(*filters):
"""
:rtype : list[Meta]
"""
return list(Meta.query.filter(*filters))
Code:
ids = ['1', '2', '3'] # obviously fake list of ids
decs = HRDecimal.qfilter(
HRDecimal.dec_id.in_(ids))
metas = Meta.qfilter(
Meta.dec_id.in_(ids))
combined = []
for ident in ids:
combined.append((
ident,
[dec for dec in decs if dec.dec_id == ident],
[hm for hm in metas if hm.dec_id == ident]
))
For the above, there wasn't a problem, but when I'm processing a list of ids that may contain a few thousand ids, this process started taking a huge amount of time, and if done from a web request in flask, the thread would often be killed.
When I started poking around with why this was happening, the key area was
[dec for dec in decs if dec.dec_id == ident],
[hm for hm in metas if hm.dec_id == ident]
At some point during the combining of these (what I thought were) Python objects, at some point calling dec.dec_id and hm.dec_id, in the SQLAlchemy code, at best, we go into,
def __get__(self, instance, owner):
if instance is None:
return self
dict_ = instance_dict(instance)
if self._supports_population and self.key in dict_:
return dict_[self.key]
else:
return self.impl.get(instance_state(instance), dict_)
Of InstrumentedAttribute in sqlalchemy/orm/attributes.py which seems to be very slow, but even worse than this, I've observed times when fields expired, and then we enter,
def get(self, state, dict_, passive=PASSIVE_OFF):
"""Retrieve a value from the given object.
If a callable is assembled on this object's attribute, and
passive is False, the callable will be executed and the
resulting value will be set as the new value for this attribute.
"""
if self.key in dict_:
return dict_[self.key]
else:
# if history present, don't load
key = self.key
if key not in state.committed_state or \
state.committed_state[key] is NEVER_SET:
if not passive & CALLABLES_OK:
return PASSIVE_NO_RESULT
if key in state.expired_attributes:
value = state._load_expired(state, passive)
Of AttributeImpl in the same file. Horrible issue here is that state._load_expired re-runs the SQL Query completely. So in a situation like this, with a big list of idents, we end up running thousands of "small" SQL queries to the database, where I think we should have only been running two "large" ones at the top.
Now, I've gotten around the expired issue by how I initialise the database for flask with session-options, changing
app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(app)
to
app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(
app,
session_options=dict(autoflush=False, autocommit=False, expire_on_commit=False))
This has definitely improved the above situation for when a rows fields just seemed to expire seemingly (from my observations) at random, but the "normal" slowness of accessing items to SQLAlchemy is still an issue for what we're currently running.
Is there any way with SQLAlchemy, to get a "real" Python object returned from a query, instead of a proxied one like it is now, so it isn't being affected by this?

Your randomness is probably related to either explicitly committing or rolling back at an inconvenient time, or due to auto-commit of some kind. In its default configuration SQLAlchemy session expires all ORM-managed state when a transaction ends. This is usually a good thing, since when a transaction ends you've no idea what the current state of the DB is. This can be disabled, as you've done with expire_on_commit=False.
The ORM is also ill suited for extremely large bulk operations in general, as explained here. It is very well suited for handling complex object graphs and persisting those to a relational database with much less effort on your part, as it organizes the required inserts etc. for you. An important part of that is tracking changes to instance attributes. The SQLAlchemy Core is better suited for bulk.
It looks like you're performing 2 queries that produce a potentially large amount of results and then do a manual "group by" on the data, but in a rather unperforming way, because for each id you have you scan the entire list of results, or O(nm), where n is the number of ids and m the results. Instead you should group the results to lists of objects by id first and then perform the "join". On some other database systems you could handle the grouping in SQL directly, but alas MySQL has no notion of arrays, other than JSON.
A possibly more performant version of your grouping could be for example:
from itertools import groupby
from operator import attrgetter
ids = ['1', '2', '3'] # obviously fake list of ids
# Order the results by `dec_id` for Python itertools.groupby. Cannot
# use your `qfilter()` method as it produces lists, not queries.
decs = HRDecimal.query.\
filter(HRDecimal.dec_id.in_(ids)).\
order_by(HRDecimal.dec_id).\
all()
metas = Meta.query.\
filter(Meta.dec_id.in_(ids)).\
order_by(Meta.dec_id).\
all()
key = attrgetter('dec_id')
decs_lookup = {dec_id: list(g) for dec_id, g in groupby(decs, key)}
metas_lookup = {dec_id: list(g) for dec_id, g in groupby(metas, key)}
combined = [(ident,
decs_lookup.get(ident, []),
metas_lookup.get(ident, []))
for ident in ids]
Note that since in this version we iterate over the queries only once, all() is not strictly necessary, but it should not hurt much either. The grouping could also be done without sorting in SQL with defaultdict(list):
from collections import defaultdict
decs = HRDecimal.query.filter(HRDecimal.dec_id.in_(ids)).all()
metas = Meta.query.filter(Meta.dec_id.in_(ids)).all()
decs_lookup = defaultdict(list)
metas_lookup = defaultdict(list)
for d in decs:
decs_lookup[d.dec_id].append(d)
for m in metas:
metas_lookup[m.dec_id].append(m)
combined = [(ident, decs_lookup[ident], metas_lookup[ident])
for ident in ids]
And finally to answer your question, you can fetch "real" Python objects by querying for the Core table instead of the ORM entity:
decs = HRDecimal.query.\
filter(HRDecimal.dec_id.in_(ids)).\
with_entities(HRDecimal.__table__).\
all()
which will result in a list of namedtuple like objects that can easily be converted to dict with _asdict().

Why is Django returning stale cache data?

I have two Django models as shown below, MyModel1 & MyModel2:
class MyModel1(CachingMixin, MPTTModel):
name = models.CharField(null=False, blank=False, max_length=255)
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
class MyModel2(CachingMixin, models.Model):
name = models.CharField(null=False, blank=False, max_length=255)
model1 = models.ManyToManyField(MyModel1, related_name="MyModel2_MyModel1")
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
MyModel2 has a ManyToMany field to MyModel1 entitled model1
Now look what happens when I add a new entry to this ManyToMany field. According to Django, it has no effect:
>>> m1 = MyModel1.objects.all()[0]
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
Why? It seems definitely like a caching issue because I see that there is a new entry in Database table myapp_mymodel2_mymodel1 for this link between m2 & m1. How should I fix it??

Is django-cache-machine really needed?
MyModel1.objects.all()[0]
Roughly translates to
SELECT * FROM app_mymodel LIMIT 1
Queries like this are always fast. There would not be a significant difference in speeds whether you fetch it from the cache or from the database.
When you use cache manager you actually add a bit of overhead here that might make things a bit slower. Most of the time this effort will be wasted because there may not be a cache hit as explained in the next section.
How django-cache-machine works
Whenever you run a query, CachingQuerySet will try to find that query
in the cache. Queries are keyed by {prefix}:{sql}. If it’s there, we
return the cached result set and everyone is happy. If the query isn’t
in the cache, the normal codepath to run a database query is executed.
As the objects in the result set are iterated over, they are added to
a list that will get cached once iteration is done.
source: https://cache-machine.readthedocs.io/en/latest/
Accordingly, with the two queries executed in your question being identical, cache manager will fetch the second result set from memcache provided the cache hasn't been invalided.
The same link explains how cache keys are invalidated.
To support easy cache invalidation, we use “flush lists” to mark the
cached queries an object belongs to. That way, all queries where an
object was found will be invalidated when that object changes. Flush
lists map an object key to a list of query keys.
When an object is saved or deleted, all query keys in its flush list
will be deleted. In addition, the flush lists of its foreign key
relations will be cleared. To avoid stale foreign key relations, any
cached objects will be flushed when the object their foreign key
points to is invalidated.
It's clear that saving or deleting an object would result in many objects in the cache having to be invalidated. So you are slowing down these operations by using cache manager. Also worth noting is that the invalidation documentation does not mention many to many fields at all. There is an open issue for this and from your comment on that issue it's clear that you have discovered it too.
Solution
Chuck cache machine. Caching all queries are almost never worth it. It leads to all kind of hard to find bugs and issues. The best approach is to optimize your tables and fine tune your queries. If you find a particular query that is too slow cache it manually.

This was my workaround solution:
>>> m1 = MyModel1.objects.all()[0]
>>> m1
<MyModel1: ID: 8887972990743179; name: my-name-blahblah>
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
>>> MyModel1.objects.invalidate(m1)
>>> MyModel2.objects.invalidate(m2)
>>> m2.save()
>>> m2.model1.all()
[<MyModel1: ID: 8887972990743179; name: my-name-blahblah>]

Have you considered hooking into the model signals to invalidate the cache when an object is added? For your case you should look at M2M Changed Signal
Small example that doesn't solve your problem but relates the workaround you gave before to my signals solution approach (I don't know django-cache-machine):
def invalidate_m2m(sender, **kwargs):
instance = kwargs.get('instance', None)
action = kwargs.get('action', None)
if action == 'post_add':
Sender.objects.invalidate(instance)
m2m_changed.connect(invalidate_m2m, sender=MyModel2.model1.through)

A. J. Parr answer is alomost correct, but you forgot post_remove and also you can bind it to every ManytoManyfield like this :
from django.db.models.signals import m2m_changed
from django.dispatch import receiver
#receiver(m2m_changed, )
def invalidate_cache_m2m(sender, instance, action, reverse, model, pk_set, **kwargs ):
if action in ['post_add', 'post_remove'] :
model.objects.invalidate(instance)

get_or_create django model with ManyToMany field

Suppose I have three django models:
class Section(models.Model):
name = models.CharField()
class Size(models.Model):
section = models.ForeignKey(Section)
size = models.IntegerField()
class Obj(models.Model):
name = models.CharField()
sizes = models.ManyToManyField(Size)
I would like to import a large amount of Obj data where many of the sizes fields will be identical. However, since Obj has a ManyToMany field, I can't just test for existence like I normally would. I would like to be able to do something like this:
try:
x = Obj(name='foo')
x.sizes.add(sizemodel1) # these can be looked up with get_or_create
...
x.sizes.add(sizemodelN) # these can be looked up with get_or_create
# Now test whether x already exists, so I don't add a duplicate
try:
Obj.objects.get(x)
except Obj.DoesNotExist:
x.save()
However, I'm not aware of a way to get an object this way, you have to just pass in keyword parameters, which don't work for ManyToManyFields.
Is there any good way I can do this? The only idea I've had is to build up a set of Q objects to pass to get:
myq = myq & Q(sizes__id=sizemodelN.id)
But I am not sure this will even work...

Use a through model and then .get() against that.
http://docs.djangoproject.com/en/dev/topics/db/models/#extra-fields-on-many-to-many-relationships
Once you have a through model, you can .get() or .filter() or .exists() to determine the existence of an object that you might otherwise want to create. Note that .get() is really intended for columns where unique is enforced by the DB - you might have better performance with .exists() for your purposes.
If this is too radical or inconvenient a solution, you can also just grab the ManyRelatedManager and iterate through to determine if the object exists:
object_sizes = obj.sizes.all()
exists = object_sizes.filter(id__in = some_bunch_of_size_object_ids_you_are_curious_about).exists()
if not exists:
(your creation code here)

Your example doesn't make much sense because you can't add m2m relationships before an x is saved, but it illustrated what you are trying to do pretty well. You have a list of Size objects created via get_or_create(), and want to create an Obj if no duplicate obj-size relationship exists?
Unfortunately, this is not possible very easily. Chaining Q(id=F) & Q(id=O) & Q(id=O) doesn't work for m2m.
You could certainly use Obj.objects.filter(size__in=Sizes) but that means you'd get a match for an Obj with 1 size in a huge list of sizes.
Check out this post for an __in exact question, answered by Malcolm, so I trust it quite a bit.
I wrote some python for fun that could take care of this.
This is a one time import right?
def has_exact_m2m_match(match_list):
"""
Get exact Obj m2m match
"""
if isinstance(match_list, QuerySet):
match_list = [x.id for x in match_list]
results = {}
match = set(match_list)
for obj, size in \
Obj.sizes.through.objects.filter(size__in=match).values_list('obj', 'size'):
# note: we are accessing the auto generated through model for the sizes m2m
try:
results[obj].append(size)
except KeyError:
results[obj] = [size]
return bool(filter(lambda x: set(x) == match, results.values()))
# filter any specific objects that have the exact same size IDs
# if there is a match, it means an Obj exists with exactly
# the sizes you provided to the function, no more.
sizes = [size1, size2, size3, sizeN...]
if has_exact_m2m_match(sizes):
x = Obj.objects.create(name=foo) # saves so you can use x.sizes.add
x.sizes.add(sizes)

Does a django query save its result after it's been called?

I'm trying to determine whether or not a simple caching trick will actually be useful. I know Django querysets are lazy to improve efficiency, but I'm wondering if they save the result of their query after the data has been called.
For instance, if I have two models:
class Klass1(models.Model):
k2 = models.ForeignKey('Klass2')
class Klass2(models.Model):
# Model Code ...
#property
def klasses(self):
self.klasses = Klass1.objects.filter(k2=self)
return self.klasses
And I call klass_2_instance.klasses[:] somewhere, then the database is accessed and returns a query. I'm wondering if I call klass_2_instance.klasses again, will the database be accessed a second time, or will the django query save the result from the first call?

Django will not cache it for you.
Instead of Klass1.objects.filter(k2=self), you could just do self.klass1_set.all().
Because Django always create a set in the many side of 1-n relations.
I guess this kind of cache is complicated because it should remember all filters, excludes and order_by used. Although it could be done using any well designed hash, you should at least have a parameter to disable cache.
If you would like any cache, you could do:
class Klass2(models.Model):
def __init__(self, *args, **kwargs):
self._klass1_cache = None
super(Klass2, self).__init__(*args, **kwargs)
def klasses(self):
if self._klass1_cache is None:
# Here you can't remove list(..) because it is forcing query execution exactly once.
self._klass1_cache = list(self.klass1_set.all())
return self._klass1_cache
This is very useful when you loop many times in all related objects. For me it often happens in template, when I need to loop more than one time.

This query isn't cached by Django.
The forwards FK relationship - ie given a Klass object klass, doing klass.k2 - is cached after the first lookup. But the reverse, which you're doing here - and which is actually usually spelled klass2.klass_set.all() - is not cached.
You can easily memoize it:
#property
def klasses(self):
if not hasattr(self, '_klasses'):
self._klasses = self.klass_set.all()
return self._klasses
(Note that your existing code won't work, as you're overriding the method klasses with an attribute.)

Try using johnny-cache if you want transparent caching of querysets.

Django - AutoField with regards to a foreign key

I have a model with a unique integer that needs to increment with regards to a foreign key, and the following code is how I currently handle it:
class MyModel(models.Model):
business = models.ForeignKey(Business)
number = models.PositiveIntegerField()
spam = models.CharField(max_length=255)
class Meta:
unique_together = (('number', 'business'),)
def save(self, *args, **kwargs):
if self.pk is None: # New instance's only
try:
highest_number = MyModel.objects.filter(business=self.business).order_by('-number').all()[0].number
self.number = highest_number + 1
except ObjectDoesNotExist: # First MyModel instance
self.number = 1
super(MyModel, self).save(*args, **kwargs)
I have the following questions regarding this:
Multiple people can create MyModel instances for the same business, all over the internet. Is it possible for 2 people creating MyModel instances at the same time, and .count() returns 500 at the same time for both, and then both try to essentially set self.number = 501 at the same time (raising an IntegrityError)? The answer seems like an obvious "yes, it could happen", but I had to ask.
Is there a shortcut, or "Best way" to do this, which I can use (or perhaps a SuperAutoField that handles this)?
I can't just slap a while model_not_saved: try:, except IntegrityError: in, because other restraints in the model could lead to an endless loop, and a disaster worse than Chernobyl (maybe not quite that bad).

You want that constraint at the database level. Otherwise you're going to eventually run into the concurrency problem you discussed. The solution is to wrap the entire operation (read, increment, write) in a transaction.
Why can't you use an AutoField for instead of a PositiveIntegerField?
number = models.AutoField()
However, in this case number is almost certainly going to equal yourmodel.id, so why not just use that?
Edit:
Oh, I see what you want. You want a numberfield that doesn't increment unless there's more than one instance of MyModel.business.
I would still recommend just using the id field if you can, since it's certain to be unique. If you absolutely don't want to do that (maybe you're showing this number to users), then you will need to wrap your save method in a transaction.
You can read more about transactions in the docs:
http://docs.djangoproject.com/en/dev/topics/db/transactions/
If you're just using this to count how many instances of MyModel have a FK to Business, you should do that as a query rather than trying to store a count.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.