How to manage concurrency in querys? - python

I have this code:
if LikedSlot.objects.filter(restaurant__id=r.id, user__id=u.id).count() == 0:
l = LikedSlot.objects.create(restaurant=r, user=u)
So the idea is to create a new LikedSlot only if the user didn't liked the restaurant before, but I have a race condition because two requests can get True in the first line if it's reached at the same time.
I tried the following but it doesn't seem to fix the issue either:
from django.db import transaction
with transaction.atomic():
if LikedSlot.objects.filter(restaurant__id=r.id, user__id=u.id).count() == 0:
l = LikedSlot.objects.create(restaurant=r, user=u)
Do you have an idea how to fix this?

I suggest you to use your database's referential integrity in such cases. Change your model so that the resturant+user pair is unique:
class LikedSlot(models.Model):
...
class Meta:
unique_together = ('restaurant', 'user',)
This way the database will prevent duplicate records from being created.
After making this change, you can also use the built-in get_or_create function instead of checking for duplicates yourself:
liked_slot, created = LikedSlot.objects.get_or_create(restaurant=r, user=u)

Related

django queryset filter with back reference

I'm a C++ developer and noob of python, just following django tutorials.
I want to know how to filter queryset by its back references' information.
Below is my models.
# models.py
import datetime
from django.db import models
from django.utils import timezone
class Question(models.Model):
pub_date = models.DateTimeField('date published')
class Choice(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
Now, I want to get query set of Question which pub_date is past from now AND is referenced by any Choice. The second statement causes my problem.
Below are what I tried.
# First method
question_queryset = Question.objects.filter(pub_date__lte=timezone.now())
for q in question_queryset.iterator():
if Choice.objects.filter(question=q.pk).count() == 0:
print(q)
# It works. Only #Question which is not referenced
# by any #Choice is printed.
# But how can I exclude #q in #question_queryset?
# Second method
question_queryset = Question.objects.filter(pub_date__lte=timezone.now()
& Choice.objects.filter(question=pk).count()>0) # Not works.
# NameError: name 'pk' is not defined
# How can I use #pk as rvalue in #Question.objects.filter context?
Is it because I'm not familiar with Python syntax? Or is the approach itself to data wrong? Do you have any good ideas for solving my problem without changing the model?
=======================================
edit: I just found the way for the first method.
# First method
question_queryset = Question.objects.filter(pub_date__lte=timezone.now())
for q in question_queryset.iterator():
if Choice.objects.filter(question=q.pk).count() == 0:
question_queryset = question_queryset.exclude(pk=q.pk)
A new concern arises: if the number of #Question rows is n and #Choice's is m, above method takes O(n * m) times, right? Is there any way to increase performance? Could it be that my way of handling data is the problem? Or is the structure of the data a problem?
Here is the documentation on how to follow relationships backwards. The following query yields the same result:
queryset = (Question.objects
.filter(pub_date__lte=timezone.now())
.annotate(num_choices=Count('choice'))
.filter(num_choices__gt=0))
It is probably better to rely on the Django ORM than writing your own filter. I believe that in the best scenario the time complexity will be the same.
Related to the design, this kind of relationship will lead to duplicates in your database, different questions sometimes have the same answer. I would probably go with a many-to-many relationship instead.
Thats not how the querysets are supposed to work. Iterating the quueryset is iterating every data in the queryset that is returned by your database. You don't need to use iterate()
question_queryset = Question.objects.filter(pub_date=timezone.now())
for q in question_queryset:
if Choice.objects.filter(question=q.pk).count() == 0:
print(q)
I didn't test it. But this should work.

Can I use regular expressions in Django F() expressions?

I have a model:
class MyModel(models.Model):
long_name = models.CharField(unique=True, max_length=256)
important_A = models.CharField(unique=True, max_length=256)
important_B = models.CharField(unique=True, max_length=256)
MyModel.long_name contains information, that I need to put in dedicated fields (important_A and important_B). An example of a string in long_name would be S1_arctic_mosaic_tile_029r_016d_300_20221108T062432.png
I basically need to match one part of the string in long_name, i.e. everything between the 4. and the 5. underscore ("029r") and put it into important_A, and everything between the 5. and the 6. ("016d") into important_B.
Since the database (PostgreSQL on Django 3.2.15) is quite big (~2.000.000 rows), looping (and using things like Python's str.split()) is not an option, as this would take too long.
I'm thus looking for a way to use regex in the migration to populate important_A and important_B from long_field. My current migration looks like this:
from django.db import migrations, models
from django.db.models import F
def populate_fields(apps, schema_editor):
MyModel = apps.get_model("myapp", "mymodel")
MyModel.objects.all().update(important_A=
F('long_name=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"')
)
class Migration(migrations.Migration):
dependencies = [
('icedata', '0036_something'),
]
operations = [
migrations.RunPython(populate_fields),
]
When I try to run this migration, I get the following error:
django.core.exceptions.FieldError: Cannot resolve keyword 'filename=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"' into field. Choices are: long_name, id
When I instead use F('long_name__regex=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"'), I instead get:
Cannot resolve keyword 'regex=r"S1_.*_(\\d{2,3}(r|l)_\\d{2,3}(u|d))_.*\\.png"' into field. Join on 'long_name' not permitted.
How can I use regular expressions together with F()-expressions?
Or, if I can't, is there another way to use the database to extract part of a string and put it in another field?
I'm surprised that updating 2M rows is "too slow", but you would definitely want to avoid creating two million objects at once, or doing 2M DB queries to update a single object. You might:
Edit the model to create important_A and important_B fields with default values which cannot ever be valid in production. (Blank or null, usually). Makemigrations and migrate.
Run code to update the database a sane number of objects at a time.
Something like:
DEFAULT = ''
BATCH_SIZE = 1000
while True:
objects = list( MyModel.objects.filter( important_A=DEFAULT)[:BATCH_SIZE] )
if len(objects) == 0:
break # all done
for o in objects:
# stuff to get new_A and new_B non-DEFAULT values
o.important_A = new_A
o.important_B = new_B
assert new_A != DEFAULT, 'avoid infinite loop bug'
n_updated = MyModel.objects.bulk_update(
objects, ['important_A','important_B']
)
assert n_updated == len(objects), 'WTF?' # return values should be checked
Implement methods on MyModel to make sure that the important_A and important_B fields can never get out of sync with long_name (if the relationship is permanent rather than as-of-now, that is). This could be the save method, or properties with getters and setters to cross-reference the fields.

How to return field as set() using peewee

I have currently worked abit with ORM using Peewee and I have been trying to understand how I am able to get the field url from the table. The condition is that column visible needs to be true as well. Meaning that if visible is True and the store_id is 4 then return all the url as set.
I have currently done something like this
from peewee import (
Model,
TextField,
BooleanField
)
from playhouse.pool import PooledPostgresqlDatabase
# -------------------------------------------------------------------------
# Connection to Postgresql
# -------------------------------------------------------------------------
postgres_pool = PooledPostgresqlDatabase(
'xxxxxxx',
host='xxxxxxxx',
user='xxxxxxxx',
password='xxxxxx',
max_connections=20,
stale_timeout=30,
)
# ------------------------------------------------------------------------------- #
class Products(Model):
store_id = TextField(column_name='store_id')
url = TextField(column_name='url')
visible = BooleanField(column_name='visible')
class Meta:
database = postgres_pool
db_table = "develop"
#classmethod
def get_urls(cls):
try:
return set([i.url for i in cls.select().where((cls.store_id == 4) & (cls.visible))])
except Products.IntegrityError:
return None
However using the method takes around 0.13s which feels abit too long for me than what it supposed to do which I believe is due to the for loop and needing to put it as a set() and I wonder if there is a possibility that peewee can do something like cls.select(cls.url).where((cls.store_id == 4) & (cls.visible) and return as set()?
How many products do you have? How big is this set? Why not use distinct() so that the database de-duplicates them for you? What indexes do you have? All of these questions are much more pertinent than "how do I make this python loop faster".
I'd suggest that you need an index on store_id, visible or store_id where visible.
create index "product_urls" on "products" ("store_id") where "visible"
You could even use a covering index but this may take up a lot of disk space:
create index "product_urls" on "products" ("store_id", "url") where visible
Once you've got the actual query sped up with an index, you can also use distinct() to make the db de-dupe the URLs before sending them to Python. Additionally, since you only need the URL, just select that column and use the tuples() method to avoid creating a class:
#classmethod
def get_urls(cls):
query = cls.select(cls.url).where((cls.store_id == 4) & cls.visible)
return set(url for url, in query.distinct().tuples())
Lastly please read the docs: http://docs.peewee-orm.com/en/latest/peewee/querying.html#iterating-over-large-result-sets

Filter queryset to return only the best result for each user

I have a couple of django models, one of which holds a number of user results for different events. I'm looking for a way to generate a queryset consisting of only the best (highest) result for each user that also has the other attributes attached (like the date of the result).
My models are as shown as well as using the built in user model:
class CombineEvents(models.Model):
team = models.ForeignKey(Team)
event = models.CharField(max_length=100)
metric = models.CharField(max_length=100)
lead_order = models.IntegerField()
def __unicode__(self):
return self.event
class CombineResults(models.Model):
user = models.ForeignKey(User)
date = models.DateField()
event = models.ForeignKey(CombineEvents)
result = models.FloatField()
def __unicode__(self):
return str(self.date) + " " + str(self.event)
I am iterating through each event and attaching a queryset of the events results, which is working fine, but I want that sub-queryset to only include one object for each user and that object should be that user's best result. My queryset code is below:
combine_events = CombineEvents.objects.filter(team__id=team_id)
for event in combine_events:
event.results = CombineResults.objects.filter(event=event)
I'm not sure how filter down to just those best results for each user. I want to use these querysets to create leaderboards, so I'd still like to be able to also have the date of that best result and the user name, but don't want the leaderboard to allow more than one spot per user. Any ideas?
Since your CombineResults model has a FK relation to CombineEvents, you can do something like this:
combine_events = CombineEvents.objects.filter(team__id=team_id)
for event in combine_events:
result = event.combineresults_set.order_by('-result')[0]
The combineresults_set attribute is auto-generated by the FK field, though you can set it to something more helpful by specifying the related_name keyword argument:
class CombineResults(models.Model):
event = models.ForeignKey(CombineEvents, related_name='results')
would enable you to call event.results.order_by(...). There is more in the documentation here:
https://docs.djangoproject.com/en/1.9/topics/db/queries/#following-relationships-backward
Note that this isn't the most DB-friendly approach as you will effectively hit the database once to get combine_events (as soon you start iterating), and then again for each event in that list. It will probably be better to use prefetch_related(), which you can use to make two DB queries only. Documentation can be found here.
prefetch_related() however will default to do a queryset.all() for the related documents, which you could further control by using Prefetch objects as documented here.
Edit:
Apologies for getting the question wrong. Getting every user's best result per event (which is what I think you want) is not quite as simple. I'd probably do something like this:
from django.db.models import Q, Max
combine_events = CombineEvents.objects \
.filter(team_id=team_id) \
.prefetch_related('combineresults_set')
for event in combine_events:
# Get the value of the best result per user
result = event.combineresults_set.values('user').annotate(best=Max('result'))
# Now construct a Q() object, note this will evaluate the result query
base_q = Q()
for res in result:
# this is the equivalent to base_q = base_q | ....
base_q |= (Q(user_id=res['user']) & Q(result=res['best']))
# Now you're ready to filter results
result = event.combineresults_set.filter(base_q)
You can read more about Q objects here, or alternatively write your own SQL using RawSQL and the likes. Or wait for someone with a better idea..

get_or_create django model with ManyToMany field

Suppose I have three django models:
class Section(models.Model):
name = models.CharField()
class Size(models.Model):
section = models.ForeignKey(Section)
size = models.IntegerField()
class Obj(models.Model):
name = models.CharField()
sizes = models.ManyToManyField(Size)
I would like to import a large amount of Obj data where many of the sizes fields will be identical. However, since Obj has a ManyToMany field, I can't just test for existence like I normally would. I would like to be able to do something like this:
try:
x = Obj(name='foo')
x.sizes.add(sizemodel1) # these can be looked up with get_or_create
...
x.sizes.add(sizemodelN) # these can be looked up with get_or_create
# Now test whether x already exists, so I don't add a duplicate
try:
Obj.objects.get(x)
except Obj.DoesNotExist:
x.save()
However, I'm not aware of a way to get an object this way, you have to just pass in keyword parameters, which don't work for ManyToManyFields.
Is there any good way I can do this? The only idea I've had is to build up a set of Q objects to pass to get:
myq = myq & Q(sizes__id=sizemodelN.id)
But I am not sure this will even work...
Use a through model and then .get() against that.
http://docs.djangoproject.com/en/dev/topics/db/models/#extra-fields-on-many-to-many-relationships
Once you have a through model, you can .get() or .filter() or .exists() to determine the existence of an object that you might otherwise want to create. Note that .get() is really intended for columns where unique is enforced by the DB - you might have better performance with .exists() for your purposes.
If this is too radical or inconvenient a solution, you can also just grab the ManyRelatedManager and iterate through to determine if the object exists:
object_sizes = obj.sizes.all()
exists = object_sizes.filter(id__in = some_bunch_of_size_object_ids_you_are_curious_about).exists()
if not exists:
(your creation code here)
Your example doesn't make much sense because you can't add m2m relationships before an x is saved, but it illustrated what you are trying to do pretty well. You have a list of Size objects created via get_or_create(), and want to create an Obj if no duplicate obj-size relationship exists?
Unfortunately, this is not possible very easily. Chaining Q(id=F) & Q(id=O) & Q(id=O) doesn't work for m2m.
You could certainly use Obj.objects.filter(size__in=Sizes) but that means you'd get a match for an Obj with 1 size in a huge list of sizes.
Check out this post for an __in exact question, answered by Malcolm, so I trust it quite a bit.
I wrote some python for fun that could take care of this.
This is a one time import right?
def has_exact_m2m_match(match_list):
"""
Get exact Obj m2m match
"""
if isinstance(match_list, QuerySet):
match_list = [x.id for x in match_list]
results = {}
match = set(match_list)
for obj, size in \
Obj.sizes.through.objects.filter(size__in=match).values_list('obj', 'size'):
# note: we are accessing the auto generated through model for the sizes m2m
try:
results[obj].append(size)
except KeyError:
results[obj] = [size]
return bool(filter(lambda x: set(x) == match, results.values()))
# filter any specific objects that have the exact same size IDs
# if there is a match, it means an Obj exists with exactly
# the sizes you provided to the function, no more.
sizes = [size1, size2, size3, sizeN...]
if has_exact_m2m_match(sizes):
x = Obj.objects.create(name=foo) # saves so you can use x.sizes.add
x.sizes.add(sizes)

Categories