I recently started using Pony ORM, and I think it's awesome. Even though the api is well documented on the official website, I'm having a hard time working with relationships. In particular I want to insert a new entity which is part of a set. However I cannot seem to find a way to create the entity without fetching related objects first:
post = Post(
#author = author_id, # Python complains about author's type, which is int and should be User
author = User[author_id], # This works, but as I understand it actually fetches the user object
#author = User(id=author_id), # This will try and (fail to) create a new record for the User table when I commit, am I right?
# ...
)
In the end only the id value is inserted in the table, why should I fetch the entire object when I only need the id?
EDIT
I had a quick Peek at Pony ORM source code and using the primary key of the reverse entity should work, but even in that case we end up calling _get_by_raw_pkval_ which fetches the object either from the local cache or from the database, thus probably it's not possible.
It is a part of inner API and also not the way Pony assumes you use it, but you actually can use author = User._get_by_raw_pkval_((author_id,)) if you sure that you has those objects with these ids.
Related
I've been trying to build a Tutorial system that we usually see on websites. Like the ones we click next -> next -> previous etc to read.
All Posts are stored in a table(model) called Post. Basically like a pool of post objects.
Post.objects.all() will return all the posts.
Now there's another Table(model)
called Tutorial That will store the following,
class Tutorial(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
tutorial_heading = models.CharField(max_length=100)
tutorial_summary = models.CharField(max_length=300)
series = models.CharField(max_length=40) # <---- Here [10,11,12]
...
Here entries in this series field are post_ids stored as a string representation of a list.
example: series will have [10,11,12] where 10, 11 and 12 are post_id that correspond to their respective entries in the Post table.
So my table entry for Tutorial model looks like this.
id heading summary series
"5" "Series 3 Tutorial" "lorem on ullt consequat." "[12, 13, 14]"
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
Now, I've read from several stackoverflow posts that having multiple entries in a single field is a bad idea. And having this relationship to span over multiple tables as a mapping is a better option.
What I want to have is the ability to insert new posts into this series anywhere I want. Maybe in the front or middle. This can be easily accomplished by treating this series as a list and inserting as I please. Altering "[14,12,13]" will reorder the posts that are being displayed.
My question is, Is this way of storing multiple values in field for my usecase is okay. Or will it take a performance hit Or generally a bad idea. If no then is there a way where I can preserve or alter order by spanning the relationship by using another table or there is an entirely better way to accomplish this in Django or MYSQL.
Here entries in this series field are post_ids stored as a string representation of a list.
(...)
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
DON'T DO THIS !!!
You are working with a relational database. There is one proper way to model relationships between entities in a relational database, which is to use foreign keys. In your case, depending on whether a post can belong only to a single tutorial ("one to many" relationship) or to many tutorials at the same time ("many to many" relationship, you'll want either to had to post a foreign key on tutorial, or to use an intermediate "post_tutorials" table with foreign keys on both post and tutorials.
Your solution doesn't allow the database to do it's job properly. It cannot enforce integrity constraints (what if you delete a post that's referenced by a tutorial ?), it cannot optimize read access (with proper schema the database can retrieve a tutorial and all it's posts in a single query) , it cannot follow reverse relationships (given a post, access the tutorial(s) it belongs to) etc. And it requires an external program (python code) to interact with your data, while with proper modeling you just need standard SQL.
Finally - but this is django-specific - using proper schema works better with the admin features, and with django rest framework if you intend to build a rest API.
wrt/ the ordering problem, it's a long known (and solved) issue, you just need to add an "order" field (small int should be enough). There are a couple 3rd part django apps that add support for this to both your models and the admin so it's almost plug and play.
IOW, there are absolutely no good reason to denormalize your schema this way and only good reasons to use proper relational modeling. FWIW I once had to work on a project based on some obscure (and hopefully long dead) PHP cms that had the brillant idea to use your "serialized lists" anti-pattern, and I can tell you it was both a disaster wrt/ performances and a complete nightmare to maintain. So do yourself and the world a favour: don't try to be creative, follow well-known and established best practices instead, and your life will be much happier. My 2 cents...
I can think of two approaches:
Approach One: Linked List
One way is using linked list like this:
class Tutorial(models.Model):
...
previous = models.OneToOneField('self', null=True, blank=True, related_name="next")
In this approach, you can access the previous Post of the series like this:
for tutorial in Tutorial.objects.filter(previous__isnull=True):
print(tutorial)
while(tutorial.next_post):
print(tutorial.next)
tutorial = tutorial.next
This is kind of complicated approach, for example whenever you want to add a new tutorial in middle of a linked-list, you need to change in two places. Like:
post = Tutorial.object.first()
next_post = post.next
new = Tutorial.objects.create(...)
post.next=new
post.save()
new.next = next_post
new.save()
But there is a huge benefit in this approach, you don't have to create a new table for creating series. Also, there is possibility that the order in tutorials will not be modified frequently, which means you don't need to take too much hassle.
Approach Two: Create a new Model
You can simply create a new model and FK to Tutorial, like this:
class Series(models.Model):
name = models.CharField(max_length=255)
class Tutorial(models.Model):
..
series = models.ForeignKey(Series, null=True, blank=True, related_name='tutorials')
order = models.IntegerField(default=0)
class Meta:
unique_together=('series', 'order') # it will make sure that duplicate order for same series does not happen
Then you can access tutorials in series by:
series = Series.object.first()
series.tutorials.all().order_by('tutorials__order')
Advantage of this approach is its much more flexible to access Tutorials through series, but there will be an extra table created for this, and one extra field as well to maintain order.
I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)
About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.
One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.
I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.
I apologize if my question turns out to be silly, but I'm rather new to Django, and I could not find an answer anywhere.
I have the following model:
class BlackListEntry(models.Model):
user_banned = models.ForeignKey(auth.models.User,related_name="user_banned")
user_banning = models.ForeignKey(auth.models.User,related_name="user_banning")
Now, when i try to create an object like this:
BlackListEntry.objects.create(user_banned=int(user_id),user_banning=int(banning_id))
I get a following error:
Cannot assign "1": "BlackListEntry.user_banned" must be a "User" instance.
Of course, if i replace it with something like this:
user_banned = User.objects.get(pk=user_id)
user_banning = User.objects.get(pk=banning_id)
BlackListEntry.objects.create(user_banned=user_banned,user_banning=user_banning)
everything works fine. The question is:
Does my solution hit the database to retrieve both users, and if yes, is it possible to avoid it, just passing ids?
The answer to your question is: YES.
Django will hit the database (at least) 3 times, 2 to retrieve the two User objects and a third one to commit your desired information. This will cause an absolutelly unnecessary overhead.
Just try:
BlackListEntry.objects.create(user_banned_id=int(user_id),user_banning_id=int(banning_id))
These is the default name pattern for the FK fields generated by Django ORM. This way you can set the information directly and avoid the queries.
If you wanted to query for the already saved BlackListEntry objects, you can navigate the attributes with a double underscore, like this:
BlackListEntry.objects.filter(user_banned__id=int(user_id),user_banning__id=int(banning_id))
This is how you access properties in Django querysets. with a double underscore. Then you can compare to the value of the attribute.
Though very similar, they work completely different. The first one sets an atribute directly while the second one is parsed by django, that splits it at the '__', and query the database the right way, being the second part the name of an attribute.
You can always compare user_banned and user_banning with the actual User objects, instead of their ids. But there is no use for this if you don't already have those objects with you.
Hope it helps.
I do believe that when you fetch the users, it is going to hit the db...
To avoid it, you would have to write the raw sql to do the update using method described here:
https://docs.djangoproject.com/en/dev/topics/db/sql/
If you decide to go that route keep in mind you are responsible for protecting yourself from sql injection attacks.
Another alternative would be to cache the user_banned and user_banning objects.
But in all likelihood, simply grabbing the users and creating the BlackListEntry won't cause you any noticeable performance problems. Caching or executing raw sql will only provide a small benefit. You're probably going to run into other issues before this becomes a problem.
I have an SQLAlchemy ORM class, linked to MySQL, which works great at saving the data I need down to the underlying table. However, I would like to also save the identical data to a second archive table.
Here's some psudocode to try and explain what I mean
my_data = Data() #An ORM Class
my_data.name = "foo"
#This saves just to the 'data' table
session.add(my_data)
#This will save it to the identical 'backup_data' table
my_data_archive = my_data
my_data_archive.__tablename__ = 'backup_data'
session.add(my_data_archive)
#And commits them both
session.commit()
Just a heads up, I am not interested in mapping a class to a JOIN, as in: http://www.sqlalchemy.org/docs/05/mappers.html#mapping-a-class-against-multiple-tables
I list some options below. I would go for the DB trigger if you do not need to work on those objects in your model.
use database trigger to do this job for you
create a SessionExtension which will create and add to session copy-objects (usually on before_flush). Edit-1: You can take versioning example from SA as a basic; this code is doing even more then you need.
see SA Versioning example which will not only give you a copy of the object, but the whole version history, which might be what you wish for
see Reverse mapping from a table to a model in SQLAlchemy question, where the proposed solution is described in the blogpost.
Create 2 identical models: one mapped to main table and another mapped to archive table. Create a MapperExtension with redefined method after_insert() (depending on your demands you might also need after_update() and after_delete()). This method should copy data from main model to archive and add it to the session. There are some tricks to copy all columns and many-to-many relations automagically.
Note, that you'll have to flush() session twice to store both objects since unit of work is computed before mapper extension adds new object to the session. You can redefine Session.flush() to take care of this problem. Also auto-incremented fields are assigned when the object is flushed, so you'll have to delay copying if you need them too.
It is one possible scenario which is proved to work. I'd like to know if there is a better way.