I have a model:
class MyModel(models.Model):
long_name = models.CharField(unique=True, max_length=256)
important_A = models.CharField(unique=True, max_length=256)
important_B = models.CharField(unique=True, max_length=256)
MyModel.long_name contains information, that I need to put in dedicated fields (important_A and important_B). An example of a string in long_name would be S1_arctic_mosaic_tile_029r_016d_300_20221108T062432.png
I basically need to match one part of the string in long_name, i.e. everything between the 4. and the 5. underscore ("029r") and put it into important_A, and everything between the 5. and the 6. ("016d") into important_B.
Since the database (PostgreSQL on Django 3.2.15) is quite big (~2.000.000 rows), looping (and using things like Python's str.split()) is not an option, as this would take too long.
I'm thus looking for a way to use regex in the migration to populate important_A and important_B from long_field. My current migration looks like this:
from django.db import migrations, models
from django.db.models import F
def populate_fields(apps, schema_editor):
MyModel = apps.get_model("myapp", "mymodel")
MyModel.objects.all().update(important_A=
F('long_name=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"')
)
class Migration(migrations.Migration):
dependencies = [
('icedata', '0036_something'),
]
operations = [
migrations.RunPython(populate_fields),
]
When I try to run this migration, I get the following error:
django.core.exceptions.FieldError: Cannot resolve keyword 'filename=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"' into field. Choices are: long_name, id
When I instead use F('long_name__regex=r"S1_.*_(\d{2,3}(r|l)_\d{2,3}(u|d))_.*\.png"'), I instead get:
Cannot resolve keyword 'regex=r"S1_.*_(\\d{2,3}(r|l)_\\d{2,3}(u|d))_.*\\.png"' into field. Join on 'long_name' not permitted.
How can I use regular expressions together with F()-expressions?
Or, if I can't, is there another way to use the database to extract part of a string and put it in another field?
I'm surprised that updating 2M rows is "too slow", but you would definitely want to avoid creating two million objects at once, or doing 2M DB queries to update a single object. You might:
Edit the model to create important_A and important_B fields with default values which cannot ever be valid in production. (Blank or null, usually). Makemigrations and migrate.
Run code to update the database a sane number of objects at a time.
Something like:
DEFAULT = ''
BATCH_SIZE = 1000
while True:
objects = list( MyModel.objects.filter( important_A=DEFAULT)[:BATCH_SIZE] )
if len(objects) == 0:
break # all done
for o in objects:
# stuff to get new_A and new_B non-DEFAULT values
o.important_A = new_A
o.important_B = new_B
assert new_A != DEFAULT, 'avoid infinite loop bug'
n_updated = MyModel.objects.bulk_update(
objects, ['important_A','important_B']
)
assert n_updated == len(objects), 'WTF?' # return values should be checked
Implement methods on MyModel to make sure that the important_A and important_B fields can never get out of sync with long_name (if the relationship is permanent rather than as-of-now, that is). This could be the save method, or properties with getters and setters to cross-reference the fields.
Related
I have currently worked abit with ORM using Peewee and I have been trying to understand how I am able to get the field url from the table. The condition is that column visible needs to be true as well. Meaning that if visible is True and the store_id is 4 then return all the url as set.
I have currently done something like this
from peewee import (
Model,
TextField,
BooleanField
)
from playhouse.pool import PooledPostgresqlDatabase
# -------------------------------------------------------------------------
# Connection to Postgresql
# -------------------------------------------------------------------------
postgres_pool = PooledPostgresqlDatabase(
'xxxxxxx',
host='xxxxxxxx',
user='xxxxxxxx',
password='xxxxxx',
max_connections=20,
stale_timeout=30,
)
# ------------------------------------------------------------------------------- #
class Products(Model):
store_id = TextField(column_name='store_id')
url = TextField(column_name='url')
visible = BooleanField(column_name='visible')
class Meta:
database = postgres_pool
db_table = "develop"
#classmethod
def get_urls(cls):
try:
return set([i.url for i in cls.select().where((cls.store_id == 4) & (cls.visible))])
except Products.IntegrityError:
return None
However using the method takes around 0.13s which feels abit too long for me than what it supposed to do which I believe is due to the for loop and needing to put it as a set() and I wonder if there is a possibility that peewee can do something like cls.select(cls.url).where((cls.store_id == 4) & (cls.visible) and return as set()?
How many products do you have? How big is this set? Why not use distinct() so that the database de-duplicates them for you? What indexes do you have? All of these questions are much more pertinent than "how do I make this python loop faster".
I'd suggest that you need an index on store_id, visible or store_id where visible.
create index "product_urls" on "products" ("store_id") where "visible"
You could even use a covering index but this may take up a lot of disk space:
create index "product_urls" on "products" ("store_id", "url") where visible
Once you've got the actual query sped up with an index, you can also use distinct() to make the db de-dupe the URLs before sending them to Python. Additionally, since you only need the URL, just select that column and use the tuples() method to avoid creating a class:
#classmethod
def get_urls(cls):
query = cls.select(cls.url).where((cls.store_id == 4) & cls.visible)
return set(url for url, in query.distinct().tuples())
Lastly please read the docs: http://docs.peewee-orm.com/en/latest/peewee/querying.html#iterating-over-large-result-sets
Context
I am quite new to Django and I am trying to write a complex query that I think would be easily writable in raw SQL, but for which I am struggling using the ORM.
Models
I have several models named SignalValue, SignalCategory, SignalSubcategory, SignalType, SignalSubtype that have the same structure like the following model:
class MyModel(models.Model):
id = models.BigAutoField(primary_key=True)
name = models.CharField()
fullname = models.CharField()
I also have explicit models that represent the relationships between the model SignalValue and the other models SignalCategory, SignalSubcategory, SignalType, SignalSubtype. Each of these relationships are named SignalValueCategory, SignalValueSubcategory, SignalValueType, SignalValueSubtype respectively. Below is the SignalValueCategory model as an example:
class SignalValueCategory(models.Model):
signal_value = models.OneToOneField(SignalValue)
signal_category = models.ForeignKey(SignalCategory)
Finally, I also have the two following models. ResultSignal stores all the signals related to the model Result:
class Result(models.Model):
pass
class ResultSignal(models.Model):
id = models.BigAutoField(primary_key=True)
result = models.ForeignKey(
Result
)
signal_value = models.ForeignKey(
SignalValue
)
Query
What I am trying to achieve is the following.
For a given Result, I want to retrieve all the ResultSignals that belong to it, filter them to keep the ones of my interest, and annotate them with two fields that we will call filter_group_id and filter_group_name. The values of two fields are determined by the SignalValue of the given ResultSignal.
From my perspective, the easiest way to achieve this would be first to annotate the SignalValues with their corresponding filter_group_name and filter_group_id, and then to join the resulting QuerySet with the ResultSignals. However, I think that it is not possible to join two QuerySets together in Django. Consequently, I thought that we could maybe use Prefetch objects to achieve what I am trying to do, but it seems that I am unable to make it work properly.
Code
I will now describe the current state of my queries.
First, annotating the SignalValues with their corresponding filter_group_name and filter_group_id. Note that filter_aggregator in the following code is just a complex filter that allows me to select the wanted SignalValues only. group_filter is the same filter but as a list of subfilters. Additionally, filter_name_case is a conditional expression (Case() construct):
# Attribute a group_filter_id and group_filter_name for each signal
signal_filters = SignalValue.objects.filter(
filter_aggregator
).annotate(
filter_group_id=Window(
expression=DenseRank(),
order_by=group_filters
),
filter_group_name=filter_name_case
)
Then, trying to join/annotate the SignalResults:
prefetch_object = Prefetch(
lookup="signal_value",
queryset=signal_filters,
to_attr="test"
)
result_signals: QuerySet = (
last_interview_result
.resultsignal_set
.filter(signal_value__in=signal_values_of_interest)
.select_related(
'signal_value__signalvaluecategory__signal_category',
'signal_value__signalvaluesubcategory__signal_subcategory',
'signal_value__signalvaluetype__signal_type',
'signal_value__signalvaluesubtype__signal_subtype',
)
.prefetch_related(
prefetch_object
)
.values(
"signal_value",
"test",
category=F('signal_value__signalvaluecategory__signal_category__name'),
subcategory=F('signal_value__signalvaluesubcategory__signal_subcategory__name'),
type=F('signal_value__signalvaluetype__signal_type__name'),
subtype=F('signal_value__signalvaluesubtype__signal_subtype__name'),
)
)
Normally, from my understanding, the resulting QuerySet should have a field "test" that is now available, that would contain the fields of signal_filter, the first QuerySet. However, Django complains that "test" is not found when calling .values(...) in the last part of my code: Cannot resolve keyword 'test' into field. Choices are: [...]. It is like the to_attr parameter of the Prefetch object was not taken into account at all.
Questions
Did I missunderstand the functioning of annotate() and prefetch_related() functions? If not, what am I doing wrong in my code for the specified parameter to_attr to not exist in my resulting QuerySet?
Is there a better way to join two QuerySets in Django or am I better off using RawSQL? An alternative way would be to switch to Pandas to make the join in-memory, but it is very often more efficient to do such transformations on the SQL side with well-designed queries.
You're on the right path, but just missing what prefetch does.
Your annotations are correct, but the "test" prefetch isn't really an attribute. You batch up the SELECT * FROM signal_value queries so you don't have to execute the select per row. Just drop the "test" annotation and you should be fine. https://docs.djangoproject.com/en/3.2/ref/models/querysets/#prefetch-related
Please don't use pandas, it's definitely not necessary and is a ton of overhead. As you say yourself, it's more efficient to do the transforms on the sql side
From the docs on prefetch_related:
Remember that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously cached results, and retrieve data using a fresh database query.
It's not obvious but the values() call is part of these chained methods that imply a different query, and will actually cancel prefetch_related. This should work if you remove it.
I have a model Student with manager StudentManager as given below. As property gives the last date by adding college_duration in join_date. But when I execute this property computation is working well, but for StudentManager it gives an error. How to write manager class which on the fly computes some field using model fields and which is used to filter records.
The computed field is not in model fields. still, I want that as filter criteria.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
objects = StudentManager()
#property
def college_end_date(self):
last_date = self.join_date + timezone.timedelta(days=self.college_duration)
return last_date
Error Django gives. when I tried to access Student.objects.passed_students()
django.core.exceptions.FieldError: Cannot resolve keyword 'college_end_date' into field. Choices are: join_date, college_duration
Q 1. How alias queries done in Django ORM?
By using the annotate(...)--(Django Doc) or alias(...) (New in Django 3.2) if you're using the value only as a filter.
Q 2. Why property not accessed in Django managers?
Because the model managers (more accurately, the QuerySet s) are wrapping things that are being done in the database. You can call the model managers as a high-level database wrapper too.
But, the property college_end_date is only defined in your model class and the database is not aware of it, and hence the error.
Q 3. How to write manager to filter records based on the field which is not in models, but can be calculated using fields present in the model?
Using annotate(...) method is the proper Django way of doing so. As a side note, a complex property logic may not be re-create with the annotate(...) method.
In your case, I would change college_duration field from IntegerField(...) to DurationField(...)--(Django Doc) since its make more sense (to me)
Later, update your manager and the properties as,
from django.db import models
from django.utils import timezone
class StudentManager(models.Manager):
<b>def passed_students(self):
default_qs = self.get_queryset()
college_end = models.ExpressionWrapper(
models.F('join_date') + models.F('college_duration'),
output_field=models.DateField()
)
return default_qs \
.annotate(college_end=college_end) \
.filter(college_end__lt=timezone.now().date())</b>
class Student(models.Model):
join_date = models.DateTimeField()
college_duration = models.DurationField()
objects = StudentManager()
#property
def college_end_date(self):
# return date by summing the datetime and timedelta objects
return <b>(self.join_date + self.college_duration).date()
Note:
DurationField(...) will work as expected in PostgreSQL and this implementation will work as-is in PSQL. You may have problems if you are using any other databases, if so, you may need to have a "database function" which operates over the datetime and duration datasets corresponding to your specific database.
Personally, I like this solution,
To quote #Willem Van Olsem's comment:
You don't. The database does not know anything about properties, etc. So it can not filter on this. You can make use of .annotate(..) to move the logic to the database side.
You can either do the message he shared, or make that a model field that auto calculates.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
college_end_date = models.DateTimeField()
objects = StudentManager()
def save(self, *args, **kwargs):
# Add logic here
if not self.college_end_date:
self.college_end_date = self.join_date + timezone.timedelta(days-self.college_duration)
return super.save(*args, **kwargs)
Now you can search it in the database.
NOTE: This sort of thing is best to do from the start on data you KNOW you're going to want to filter. If you have pre-existing data, you'll need to re-save all existing instances.
Problem
You’re attempting to query on a row that doesn’t exist in the database. Also, Django ORM doesn’t recognize a property as a field to register.
Solution
The direct answer to your question would be to create annotations, which could be subsequently queried off of. However, I would reconsider your table design for Student as it introduces unnecessary complexity and maintenance overhead.
There’s much more framework/db support for start date, end date idiosyncrasy than there is start date, timedelta.
Instead of storing duration, store end_date and calculate duration in a model method. This makes more not only makes more sense as students are generally provided a start date and estimated graduation date rather than duration, but also because it’ll make queries like these much easier.
Example
Querying which students are graduating in 2020.
Students.objects.filter(end_date__year=2020)
class Model1(models.Model):
username = models.CharField(max_length=100,null=False,blank=False,unique=True)
password = models.CharField(max_length=100,null=False,blank=False)
class Model2(models.Model):
name = models.ForeignKey(Model1, null=True)
unique_str = models.CharField(max_length=50,null=False,blank=False,unique=True)
city = models.CharField(max_length=100,null=False,blank=False)
class Meta:
unique_together = (('name', 'unique_str'),)
I've already filled 3 sample username-password in Model1 through django-admin page
In my views I'm getting this list as
userlist = Model1.objects.all()
#print userlist[0].username, userlist[0].password
for user in userlist:
#here I want to get or create model2 object by uniqueness defined in meta class.
#I mean unique_str can belong to multiple user so I'm making name and str together as a unique key but I dont know how to use it here with get_or_create method.
#right now (without using unique_together) I'm doing this (but I dont know if this by default include unique_together functionality )
a,b = Model2.objects.get_or_create(unique_str='f3h6y67')
a.name = user
a.city = "acity"
a.save()
What I think you're saying is that your logical key is a combination of name and unique_together, and that you what to use that as the basis for calls to get_or_create().
First, understand the unique_together creates a database constraint. There's no way to use it, and Django doesn't do anything special with this information.
Also, at this time Django cannot use composite natural primary keys, so your models by default will have an auto-incrementing integer primary key. But you can still use name and unique_str as a key.
Looking at your code, it seems you want to do this:
a, _ = Model2.objects.get_or_create(unique_str='f3h6y67',
name=user.username)
a.city = 'acity'
a.save()
On Django 1.7 you can use update_or_create():
a, _ = Model2.objects.update_or_create(unique_str='f3h6y67',
name=user.username,
defaults={'city': 'acity'})
In either case, the key point is that the keyword arguments to _or_create are used for looking up the object, and defaults is used to provide additional data in the case of a create or update. See the documentation.
In sum, to "use" the unique_together constraint you simply use the two fields together whenever you want to uniquely specify an instance.
Suppose I have three django models:
class Section(models.Model):
name = models.CharField()
class Size(models.Model):
section = models.ForeignKey(Section)
size = models.IntegerField()
class Obj(models.Model):
name = models.CharField()
sizes = models.ManyToManyField(Size)
I would like to import a large amount of Obj data where many of the sizes fields will be identical. However, since Obj has a ManyToMany field, I can't just test for existence like I normally would. I would like to be able to do something like this:
try:
x = Obj(name='foo')
x.sizes.add(sizemodel1) # these can be looked up with get_or_create
...
x.sizes.add(sizemodelN) # these can be looked up with get_or_create
# Now test whether x already exists, so I don't add a duplicate
try:
Obj.objects.get(x)
except Obj.DoesNotExist:
x.save()
However, I'm not aware of a way to get an object this way, you have to just pass in keyword parameters, which don't work for ManyToManyFields.
Is there any good way I can do this? The only idea I've had is to build up a set of Q objects to pass to get:
myq = myq & Q(sizes__id=sizemodelN.id)
But I am not sure this will even work...
Use a through model and then .get() against that.
http://docs.djangoproject.com/en/dev/topics/db/models/#extra-fields-on-many-to-many-relationships
Once you have a through model, you can .get() or .filter() or .exists() to determine the existence of an object that you might otherwise want to create. Note that .get() is really intended for columns where unique is enforced by the DB - you might have better performance with .exists() for your purposes.
If this is too radical or inconvenient a solution, you can also just grab the ManyRelatedManager and iterate through to determine if the object exists:
object_sizes = obj.sizes.all()
exists = object_sizes.filter(id__in = some_bunch_of_size_object_ids_you_are_curious_about).exists()
if not exists:
(your creation code here)
Your example doesn't make much sense because you can't add m2m relationships before an x is saved, but it illustrated what you are trying to do pretty well. You have a list of Size objects created via get_or_create(), and want to create an Obj if no duplicate obj-size relationship exists?
Unfortunately, this is not possible very easily. Chaining Q(id=F) & Q(id=O) & Q(id=O) doesn't work for m2m.
You could certainly use Obj.objects.filter(size__in=Sizes) but that means you'd get a match for an Obj with 1 size in a huge list of sizes.
Check out this post for an __in exact question, answered by Malcolm, so I trust it quite a bit.
I wrote some python for fun that could take care of this.
This is a one time import right?
def has_exact_m2m_match(match_list):
"""
Get exact Obj m2m match
"""
if isinstance(match_list, QuerySet):
match_list = [x.id for x in match_list]
results = {}
match = set(match_list)
for obj, size in \
Obj.sizes.through.objects.filter(size__in=match).values_list('obj', 'size'):
# note: we are accessing the auto generated through model for the sizes m2m
try:
results[obj].append(size)
except KeyError:
results[obj] = [size]
return bool(filter(lambda x: set(x) == match, results.values()))
# filter any specific objects that have the exact same size IDs
# if there is a match, it means an Obj exists with exactly
# the sizes you provided to the function, no more.
sizes = [size1, size2, size3, sizeN...]
if has_exact_m2m_match(sizes):
x = Obj.objects.create(name=foo) # saves so you can use x.sizes.add
x.sizes.add(sizes)