Best way to merge and unmerge objects without losing data - python

Say I have two tables (I am using Django, but this question is mostly language agnostic):
Organization(models.Model):
name = models.CharField(max_length=100)
Event(models.Model):
organization = models.ForeignKey(Organization)
name = models.CharField(max_length=100)
Users are allowed to create both events and organizations. There is the chance that two separate users create organization objects that are supposed to resemble the same real world organization. When someone notices this problem, they should be able to merge the two objects so there is only one organization.
The question I have is this: How do I merge these two organizations in order to ensure I can "unmerge" them if the user incorrectly merged them? Thus, the simple solution of deleting one of the Organization objects and pointing all Events to the other one is not an option. I am looking for very high level guidelines on best practices here.
A few possible solutions:
Add another table that joins together organizations that have been "merged" and keep track of merges that way
Add a foreign key field on Organization to point to an organization it was merged with
Keep copies of all of the original objects as they existed before a merge, using something like django-reversion

Personally, I would go with a solution which uses something like django-reversion. However, if you want to create something more robust and less dependent on 3rd party logic, add a merged_into field to Organization and merged_from field to Event:
Organization(models.Model):
name = models.CharField(max_length=100)
merged_into = models.ForeignKey('self', null=True, blank=True)
Event(models.Model):
organization = models.ForeignKey(Organization)
name = models.CharField(max_length=100)
merged_from = models.ForeignKey(Organization null=True, blank=True)
On merge, you can choose update the events as well. From now on, be sure to redirect all references of "merged_into" organizations into the new organization.
If you want to allow multiple merges (for example: A + B into C, A+C into D, E+F into G and D+G into H), you can create a new organization instance each time and merge both "parents" into it, copying the events instead of updating them. This keeps the original events intact waiting for a rollback. This also allows merging more than 2 organizations into a new one in one step.

My suggestion would be a diff-like interface. For each field, you provide all the possible values from the objects being merged. The person merging them chooses the appropriate value for each field. You'd probably only want to show fields on which a conflict was detected in this view.
After all conflicting fields have had a "good" value chosen for them. You create a new object, assign relationships from the old versions to that one, and delete the old versions.
If you're looking for some sort of automatic approach, I think you'd be hard pressed to find one, and even if you did, it would not really be a good idea. Any time you're merging anything you need a human in the middle. Even apps that sync contacts and such don't attempt to handle conflicts on their own.

I think there is a key hack.
Organization will have usual id field, and an another 'aliases' field. 'aliases' field would be comma separated ids. In that field you'll track the organizations that may be pointing to the same in real world. Let's say there was a 2 organization named organization_1, organization_2 and id is 1, 2.
organization_1 organization_2
_id = 1 _id = 2
aliases = '1, 2' aliases = '2, 1'
If you want to query event's that is only belong to organization_1, you can do it. If you want to query all events of organization_1, organization_2, you check it if aliases field contains the key. Maybe separator should be not just ',' it should also surround aliases field a whole. Something like ',1,2,'. In this way we can be sure to check if it contains ',id,'

Related

Storing multiple values into a single field in mysql database that preserve order in Django

I've been trying to build a Tutorial system that we usually see on websites. Like the ones we click next -> next -> previous etc to read.
All Posts are stored in a table(model) called Post. Basically like a pool of post objects.
Post.objects.all() will return all the posts.
Now there's another Table(model)
called Tutorial That will store the following,
class Tutorial(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
tutorial_heading = models.CharField(max_length=100)
tutorial_summary = models.CharField(max_length=300)
series = models.CharField(max_length=40) # <---- Here [10,11,12]
...
Here entries in this series field are post_ids stored as a string representation of a list.
example: series will have [10,11,12] where 10, 11 and 12 are post_id that correspond to their respective entries in the Post table.
So my table entry for Tutorial model looks like this.
id heading summary series
"5" "Series 3 Tutorial" "lorem on ullt consequat." "[12, 13, 14]"
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
Now, I've read from several stackoverflow posts that having multiple entries in a single field is a bad idea. And having this relationship to span over multiple tables as a mapping is a better option.
What I want to have is the ability to insert new posts into this series anywhere I want. Maybe in the front or middle. This can be easily accomplished by treating this series as a list and inserting as I please. Altering "[14,12,13]" will reorder the posts that are being displayed.
My question is, Is this way of storing multiple values in field for my usecase is okay. Or will it take a performance hit Or generally a bad idea. If no then is there a way where I can preserve or alter order by spanning the relationship by using another table or there is an entirely better way to accomplish this in Django or MYSQL.
Here entries in this series field are post_ids stored as a string representation of a list.
(...)
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
DON'T DO THIS !!!
You are working with a relational database. There is one proper way to model relationships between entities in a relational database, which is to use foreign keys. In your case, depending on whether a post can belong only to a single tutorial ("one to many" relationship) or to many tutorials at the same time ("many to many" relationship, you'll want either to had to post a foreign key on tutorial, or to use an intermediate "post_tutorials" table with foreign keys on both post and tutorials.
Your solution doesn't allow the database to do it's job properly. It cannot enforce integrity constraints (what if you delete a post that's referenced by a tutorial ?), it cannot optimize read access (with proper schema the database can retrieve a tutorial and all it's posts in a single query) , it cannot follow reverse relationships (given a post, access the tutorial(s) it belongs to) etc. And it requires an external program (python code) to interact with your data, while with proper modeling you just need standard SQL.
Finally - but this is django-specific - using proper schema works better with the admin features, and with django rest framework if you intend to build a rest API.
wrt/ the ordering problem, it's a long known (and solved) issue, you just need to add an "order" field (small int should be enough). There are a couple 3rd part django apps that add support for this to both your models and the admin so it's almost plug and play.
IOW, there are absolutely no good reason to denormalize your schema this way and only good reasons to use proper relational modeling. FWIW I once had to work on a project based on some obscure (and hopefully long dead) PHP cms that had the brillant idea to use your "serialized lists" anti-pattern, and I can tell you it was both a disaster wrt/ performances and a complete nightmare to maintain. So do yourself and the world a favour: don't try to be creative, follow well-known and established best practices instead, and your life will be much happier. My 2 cents...
I can think of two approaches:
Approach One: Linked List
One way is using linked list like this:
class Tutorial(models.Model):
...
previous = models.OneToOneField('self', null=True, blank=True, related_name="next")
In this approach, you can access the previous Post of the series like this:
for tutorial in Tutorial.objects.filter(previous__isnull=True):
print(tutorial)
while(tutorial.next_post):
print(tutorial.next)
tutorial = tutorial.next
This is kind of complicated approach, for example whenever you want to add a new tutorial in middle of a linked-list, you need to change in two places. Like:
post = Tutorial.object.first()
next_post = post.next
new = Tutorial.objects.create(...)
post.next=new
post.save()
new.next = next_post
new.save()
But there is a huge benefit in this approach, you don't have to create a new table for creating series. Also, there is possibility that the order in tutorials will not be modified frequently, which means you don't need to take too much hassle.
Approach Two: Create a new Model
You can simply create a new model and FK to Tutorial, like this:
class Series(models.Model):
name = models.CharField(max_length=255)
class Tutorial(models.Model):
..
series = models.ForeignKey(Series, null=True, blank=True, related_name='tutorials')
order = models.IntegerField(default=0)
class Meta:
unique_together=('series', 'order') # it will make sure that duplicate order for same series does not happen
Then you can access tutorials in series by:
series = Series.object.first()
series.tutorials.all().order_by('tutorials__order')
Advantage of this approach is its much more flexible to access Tutorials through series, but there will be an extra table created for this, and one extra field as well to maintain order.

To merge address or not to merge addresses?

Working on a sqlalchemy python flask postgresql project but not sure if I should merge my tables together. I first started off with company address in early development stages now I have another address table.
Should I keep these separate or should these be merged to a global Address table?
UserAddress
CompanyAddress
If I do merge them together to an Address Table it would be something like the following... however in the long run I still have more user_id than company_id and would mostly be blank.
Address
-user_id
-company_id
Companies and user can have multiple addresses... so this is why I'm thinking of using this approach but in the long run. Also I'm not sure if it is a good idea as for maintaining this. Or should I leave them as is and maintain them separately?
Thanks! And if possible, share your experience in dealing with similar situations?
It's pretty difficult to answer your question, there are some points to take care of:
If UserAddress and CompanyAddress have exactly the same fields and they can be interchangeable between Users and Companies and you won't want to make queries like all of the UserAddress or all of the CompanyAddress, my advice would be to merge those in the same Address table, as they represent the same entity for you in the database model.
If there are some user addresses that don't make sense as a company address or if you will add values for company address not present in user address, I recommend you to keep those tables separated, because they don't represent the same entity (a user address is not the same as a company address).
If you will make operations like "all of the addresses", or searching for text inside all of the addresses, I think I'd make sense to use inheritance. You can accomplish this by saving common fields into one Address table and the specific ones in separated user/company address tables keeping a foreign key to the main Address table, then you reference User/Company Address tables in the User/Company tables respectively.
Your safe bet is to keep the tables separated, you can always merge them later if it makes sense, but if you feel really confident and the tables are (and you think they will be) exactly the same, and you won't be querying for user or company addresses (which will require a lot of joins), just merge them into one table.
Hope it helps,

How to model a unique constraint in GAE ndb

I want to have several "bundles" (Mjbundle), which essentially are bundles of questions (Mjquestion). The Mjquestion has an integer "index" property which needs to be unique, but it should only be unique within the bundle containing it. I'm not sure how to model something like this properly, I try to do it using a structured (repeating) property below, but there is yet nothing actually constraining the uniqueness of the Mjquestion indexes. What is a better/normal/correct way of doing this?
class Mjquestion(ndb.Model):
"""This is a Mjquestion."""
index = ndb.IntegerProperty(indexed=True, required=True)
genre1 = ndb.IntegerProperty(indexed=False, required=True, choices=[1,2,3,4,5,6,7])
genre2 = ndb.IntegerProperty(indexed=False, required=True, choices=[1,2,3])
#(will add a bunch of more data properties later)
class Mjbundle(ndb.Model):
"""This is a Mjbundle."""
mjquestions = ndb.StructuredProperty(Mjquestion, repeated=True)
time = ndb.DateTimeProperty(auto_now_add=True)
(With the above model and having fetched a certain Mjbundle entity, I am not sure how to quickly fetch a Mjquestion from mjquestions based on the index. The explanation on filtering on structured properties looks like it works on the Mjbundle type level, whereas I already have a Mjbundle entity and was not sure how to quickly query only on the questions contained by that entity, without looping through them all "manually" in code.)
So I'm open to any suggestion on how to do this better.
I read this informational answer: https://stackoverflow.com/a/3855751/129202 It gives some thoughts about scalability and on a related note I will be expecting just a couple of bundles but each bundle will have questions in the thousands.
Maybe I should not use the mjquestions property of Mjbundle at all, but rather focus on parenting: each Mjquestion created should have a certain Mjbundle entity as parent. And then "manually" enforce uniqueness at "insert time" by doing an ancestor query.
When you use a StructuredProperty, all of the entities that type are stored as part of the containing entity - so when you fetch your bundle, you have already fetched all of the questions. If you stick with this way of storing things, iterating to check in code is the solution.

Archiving Django models

I'm creating an online order system for selling items on a regular basis (home delivery of vegetable boxes). I have an 'order' model (simplified) as follows:
class BoxOrder(models.Model):
customer = models.ForeignKey(Customer)
frequency = models.IntegerField(choices=((1, "Weekly"), (2, "Fortnightly)))
item = models.ForeignKey(Item)
payment_method = models.IntegerField(choices=((1, "Online"), (2, "Free)))
Now my 'customer' has the ability to change the frequency of the order, or the 'item' (say 'carrots') being sold or even delete the order all together.
What I'd like to do is create weekly 'backups' of all orders processed that week, so that I can see a historical graph of all the orders ever sold every week. The problem with just archiving the order into another table/database is that if an item (say I no longer sell carrots) is deleted for some reason, then that archived BoxOrder would become invalid because of the ForeignKeys
What would be the best solution for creating an archiving system using Django - so that orders for every week in history are viewable in Django admin, and they are 'static' (i.e. independent of whether any other objects are deleted)?
I've thought about creating a new 'flat' BoxOrderArchive model, then using a cron job to move orders for a given week over, e.g.:
class BoxOrderArchive(models.Model):
customer_name = models.CharField(max_length=20)
frequency = models.IntegerField()
item_name = models.CharField() # refers to BoxOrder.item.name
item_price = models.DecimalField(max_digits=10, decimal_places=2) # refers to BoxOrder.item.price
payment_method = models.IntegerField()
But I feel like that might be a lot of extra work. Before I go down that route, it would be great to know if anybody has any other solutions?
Thanks
This is a rather broad topic, and I won't specifically answer your question, however my advice to you is don't delete or move anything. You can add a boolan field to your Item named is_deleted or is_active or something similar and play with that flag when you delete your item. This way you can
keep your ForeignKeys,
have a different representation for non-active items
restore and Item that was previously deleted (for instance you may want to sell Carrots again after some months - this way your statistics will be consistent across the year)
The same advice is true for the BoxOrder model. Do not remove rows to different tables, just add an is_archived field and set it to True.
So, after looking into this long and hard, I think the best solution for me is to create a 'flat' version of the object, dereferencing any existing objects, and save that in the database.
The reason for this is that my 'BoxOrder' object can change every week (as the customer edits their address, item, cost etc. Keeping track of all these changes is just plain difficult.
Plus, I don't need to do anything with the data other than display it to the sites users.
Basically what I am wanting to do is to create a snapshot, and none of the existing tools really are what I want. Having said that, others may have different priorities, so here's a list of useful links:
[1] SO question regarding storing a snapshot/pickling model instances
[2] Django Simple History Docs - stores model state on every create/update/delete
[3] Django Reversion Docs - allows reverting a model instance
For discussion on [2] and [3], see the comments on Serafim's answer

Python AppEngine Sort By Referenced Property

I have a model Entry
class Entry(db.Model):
year = db.StringProperty()
.
.
.
and for whatever reason the last name field is stored in a different model LastName:
class LastName(db.Model):
entry = db.ReferenceProperty(Entry, collection_name='last_names')
last_name = db.StringProperty()
If I query Entry and sort it by year (or any other property) using .order() how would I then sort that by the last name? I'm new to python but coming from Java I would guess there's some kind of comparator equivalent; or I'm completely wrong and there's another way to do it. I for sure cannot change my model at this point in time, though that may be the solution later. Any suggestions?
EDIT: I'm currently paginating through the results using offsets (moving to cursors soon, but I think it would be the same issue). So if I try to sort outside of the datastore I would only be sorting the current set; it's possible that the first page will be all 'B's and the second page will have 'A's, so it will only be sorted by page not by overall set. Am I screwed the way my models are currently set up?
A few issues here.
There's no way to do this sorting directly in the datastore API, either in Python or Java - as you no doubt know, the datastore is non-relational, and indirect lookups like this aren't supported.
If this was just a straight one-to-one relationship, which gave you an accessor from the Entry entity to the LastName one, you could use the standard Python sort function to sort the list:
entries.sort(key=lambda e: e.last_name.last_name)
(note that this sorts the list in place but returns None, so don't try assigning from it).
However, this won't work, because what you've actually got here is a one-to-many relationship: there are potentially many LastNames for each Entry. The definition actually recognises this: the collection_name attribute, which defines the accessor from Entry to LastName, is called last_names, ie plural.
So what you're asking doesn't really make sense: which of the potentially many LastNames do you want to sort on? You can certainly do it the other way round - given a query of LastNames, sort by entry year - but given your current structure there's not really any way of doing it.
I must say though, although I don't know the rest of your models, I suspect you have actually got that relationship the wrong way round: the ReferenceProperty should probably live on Entry pointing to LastName rather than the other way round as it is now. Then it would simply be the sort call I gave above.

Categories