MultiSelectField vs separate model

MultiSelectField vs separate model - python

I'm building a directory of Hospitals and Clinics and as a speciality field I'd like to store the speciality or type of clinic or hospital (like Dermathologist, etc.). However, there are some places, specially big ones, with many different specialities in one place, and as the choices= method of a CharField doesn't allow me to select more than one option I had to think of an alternative.
At first I didn't think it was necessary to create a different table and add a relation, that's why I tried the django-multiselectfield package and it works just fine, but I was wondering if it would be better to create a different table and give it a relation to the Hospitals model. That 'type' or 'speciality' table once built it likely won't ever change in its contents. Is it better to build a different table performance-wise?
Also I'm trying to store the choices of the model in a different choices.py file with TextChoices model classes as I will be using the same choices in various fields of different models through different apps. I know is generally better to store the choices inside the same class, but does that make sense in my case?

Performance is probably not the primary concern here; I think the difference between the two approaches would be negligible. Whether one or more than one model would use the same set of choices doesn't lean one way or another; either a fixed list or many-to-many relation could accommodate that.
Although you say that the selections aren't expected to change (an argument in favor of a hard-coded list of choices), medical specialties are a kind of data that do change in the long run. Contrast this with, say, months of the year or days of the week, which are a lot less likely to change.
That said, if you already have a multi-select field working, I'd be inclined to leave it alone until there's a compelling reason to change it.

For that 2nd part, I see no issue with storing the Choice list in another .py file.
I've done that strictly to keep my models.py looking somewhat pretty- I don't want to scroll past 150 choices to double check a model method.
The 1st part is all about taste. I'd personally go the Relation + Many-To-Many route.
I always plan for edge cases so "likely won't change" = "So there's a possibility"
Also I like that the Relation + Many-To-Many route doesn't have a dependency, it's a Core Django feature.. It's pretty rock solid and future proof
an added benefit is making it another table also means that a non-technical person could potentially add new options and in theory you're not spending your time constantly changing it.

Related

Query across all schemas on identical table on Postgres

I'm using postgres and I have multiple schemas with identical tables where they are dynamically added the application code.
foo, bar, baz, abc, xyz, ...,
I want to be able to query all the schemas as if they are a single table
!!! I don't want to query all the schemas one by one and combine the results
I want to "combine"(not sure if this would be considered a huge join) the tables across schemas and then run the query.
For example, an order by query shouldn't be like
1. schema_A.result_1
2. schema_A.result_3
3. schema_B.result_2
4. schema_B.result 4
but instead it should be
1. schema_A.result_1
2. schema_B.result_2
3. schema_A.result_3
4. schema_B.result 4
If possible I don't want to generate a query that goes like
SELECT schema_A.table_X.field_1, schema_B.table_X.field_1 FROM schema_A.table_X, schema_B.table_X
But I want that to be taken care of in postgresql, in the database.
Generating a query with all the schemas(namespaces) appended can make my queries HUGE with ~50 field and ~50 schemas.
Since these tables are generated I also cannot inherit them from some global table and query that instead.
I'd also like to know if this is not really possible in a reasonable speed.
EXTRA:
I'm using django and django-tenants so I'd also accept any answer that actually helps me generate the entire query and run it to get a global queryset EVEN THOUGH it would be really slow.

Your question isn't as much of a question as it is an admission that you've got a really terrible database and applicaiton design. It's as if you parittioned something that iddn't need to be parittioned, or partitioned it in the wrong way.
Since you're doing something awkward, the database itself won't provide you with any elegant solution. Instead, you'll have to get more and more awkward until the regret becomes too much to bear and you redesign your database and/or your application.
I urge you to repent now, the sooner the better.
After that giant caveat based on a haughty moral position, I acknolwedge that the only reason we answer questions here is to get imaginary internet points. And so, my answer is this: use a view that unions all of the values together and presents them as if they came from one table. I can't make any sense of the "order by query", so I just ignore it for now. Maybe you mean that you want the results in a certain order; if so, you can add constants to each SELECT operand of each UNION ALL and ORDER BY that constant column coming out of the union. But if the order of the rows matters, I'd assert that you are showing yet another symptom of a poor database design.
You can programatically update the view whenever it is you update or create the new schemas and their catalogs.
A working example is here: http://sqlfiddle.com/#!17/c09265/1
with this schema creation and population code:
CREATE Schema Fooey;
CREATE SCHEMA Junk;
CREATE TABLE Fooey.Baz (SomeINteger INT);
CREATE TABLE Junk.Baz (SomeINteger INT);
INSERT INTO Fooey.Baz (SomeInteger) VALUES (17), (34), (51);
INSERT INTO Junk.Baz (SomeInteger) VALUES (13), (26), (39);
CREATE VIEW AllOfThem AS
SELECT 'FromFooey' AS SourceSchema, SomeINteger FROM Fooey.Baz
UNION ALL
SELECT 'FromJunk' AS SourceSchema, SomeInteger FROM Junk.Baz;
and this query:
SELECT *
FROM AllOfThem
ORDER BY SourceSchema;
Why are per-tenant schemas a bad design?
This design favors laziness over scalability. If you don't want to make changes to your application, you can simply slam connections to a particular shcema and keep working without any code changes. Adding more tennants means adding more schemas, which it sounds like you've automated. Adding many schemas will eventually make database management cumbersome (what if you have thousands or millions of tenants?) and even if you have only a few, the dynamic nature of the list and the problems in writing system-wide queries is an issue that you've already discovered.
Consider instead combining everything and adding the tenant ID as part of a key on each table. In that case, adding more tenants means adding more rows. Any summary queries trivially come from single tables, and all of the features and power of the database implementation and its query language are at your fingertips without any fuss whatsoever.
It's simply false that a database design can't be changed, even in an existing and busy system. It takes a lot of effort to do it, but it can be done and people do it all the time. That's why getting the database design right as early as possible is important.
The README of the django-tenants package you're using describes thier decision to trade-off towards laziness, and cites a whitpaper that outlines many of the shortcomings and alternatives of that method.

Is it worth to use `select_related()` when using just one instance of a model instead of a queryset?

I'll keep it short. Say we have this database structure:
class Bird(models.Model):
name = models.CharField()
specie = models.CharField()
class Feather(models.Model):
bird = models.ForeignKey(Bird)
And then we have some simple lines from an APIView:
feather_id = request.query_params['feather']
feather = Feather.objects.get(pk=feather_id)
bird_name = feather.bird.name
# a lot of lines between
bird_specie = feather.bird.specie
Does it make any difference using:
feather = Feather.objects.select_related('bird').get(pk=1)
instead of:
feather = Feather.objects.get(pk=1)
in this scenario? I saw some people using select_prefetch() in this way and i wonder if it makes any difference, if not, which one should you use? Normally i agree that select_prefetch() is useful for optimization when using querysets to avoid querying each instance individually, but in this case, is it worth using it when you have just one instance in the whole APIView?
In my opinion the only difference is that another query is just made later on, but when talking about performance, it's the same.
Thanks in advance.

Yes, it will make a difference, but it will probably be a very small difference. Selecting the related tables at the same time eliminates the time required for additional round trips to the database, likely a few milliseconds at most.
This may only matter to you if you have higher latency connecting to the database, there are many related tables that will be fetched in turn, and/or the apiview has very high load (and every millisecond counts).
I generally use select_related() on single object queries, but as a stylistic choice rather than a performance choice: to indicate which other models are going to be fetched and used (explicit is better than implicit).

classification of user population by access rights

What I am trying to do is to classify the employees by the roles they have
in an organization. This is computed by grabbing all the permissions, or
access lists, they have for the target enterprise software.
There are potentially 10000s of users and dozens of permissions per user.
Edit: when there are large amounts of users, the vast majority will have a limited set permissions. For example, they might all have Employee only. More complicated cases are power users and there will be way less.
Also, don't be misled by the permission names I have given, like Acct1/Acct2, they're just meant to give a feel for the the domain. The solution I am looking for should conceptually work even with randomnly-assigned primary key integers like to you see in many ORM stores - there is no implied relationship between permissions.
import pprint
pp = pprint.PrettyPrinter(indent=4)
def classify(employees):
"""employees assigned the same set
of permissions are grouped together"""
roles = dict()
for user, permissions in employees.items():
permissions = list(permissions)
permissions.sort()
key = tuple(permissions)
members = roles.setdefault(key, set([]))
members.add(user)
return roles
everyone = {
"Jim": set(["Employee","Acct1","Manager"]),
"Marion": set(["Employee","Acct1","Acct2"]),
"Omar": set(["Employee","Acct1"]),
"Kim": set(["Employee","Acct1"]),
"Tyler": set(["Employee","Acct1"]),
"Susan": set(["Employee","Marketing","Manager"]),
}
result = classify(everyone)
print("pass1")
pp.pprint(result)
At this point, the classification system returns the following:
{ ('Acct1', 'Acct2', 'Employee'): set(['Marion']),
('Acct1', 'Employee'): set(['Kim', 'Omar', 'Tyler']),
('Acct1', 'Employee', 'Manager'): set(['Jim']),
('Employee', 'Manager', 'Marketing'): set(['Susan'])}
From this, we can eyeball the data and manually assign some meaningful names to those roles.
Senior Accountants - Marion
Accounting Managers - Jim
Accountants - Kim, Omar, Tyler
Marketing Manager - Susan
The assignment is manual, but the intent is that it remains as "sticky" as possible even when people get hired or leave and when permission change.
Let's do a second pass.
Someone's decided to rename Acct2 to SrAcct. People get hired, Kim leaves.
This is represented by the following employee permissions:
everyone2 = {
"Jim": set(["Employee","Acct1","Manager"]),
"Marion": set(["Employee","Acct1","SrAcct"]),
"Omar": set(["Employee","Acct1"]),
"Tyler": set(["Employee","Acct1"]),
"Milton": set(["Employee","JuniorAcct"]),
"Susan": set(["Employee","Marketing","Manager"]),
"Tim": set(["Employee","Marketing"]),
}
The output this time is:
{ ('Acct1', 'Employee'): set(['Omar', 'Tyler']),
('Acct1', 'Employee', 'Manager'): set(['Jim']),
('Acct1', 'Employee', 'SrAcct'): set(['Marion']),
('Employee', 'JuniorAcct'): set(['Milton']),
('Employee', 'Manager', 'Marketing'): set(['Susan']),
('Employee', 'Marketing'): set(['Tim'])}
Ideally, we'd recognize that
Senior Accountants - Marion
Accounting Managers - Jim
Accountants - Omar, Tyler
Marketing Manager - Susan
new role - Tim
new role - Milton
Tim's roles will now be named a Marketer, while Milton a Junior Accountant.
What's important is that the role name assignment is stable enough to allow reasoning about an employee population even as people get hired and leave (most frequent) and as permissions are added or renamed (much less frequent). It's OK to ask the end user from time to time to assign new role names or to decide between ties. But most of the time, it should run along smoothly. What it shouldn't do it guess wrong and erroneously label a set of users as the wrong role name.
The problem I have is that it is easy to eyeball, but both the set of permissions and the set of users that define a role can change. Classification time is important, but the value of this classification mechanism goes up as the number of users and permissions increase.
I've tried extracting "the subset of permissions that define a role". For example, Employee is assigned to everyone so can be ignored. While (Manager, Acct1), (Manager, Marketing) uniquely belong to Jim and Susan. Trouble is that runs into a combinational explosion once you get the easy 20-30% of the cases out and it never finishes.
What I thinking now is to back and compute the new employee-permission role classification for each generation and then backtrack to get a fuzzy matching "best fit" compared to the previous generation. Pick the ones that are reasonably unambiguous and ask the user to decide on ties and to assign new role names as needed.
For example, an exact match on permissions and a reasonable match on employees means that 'Omar', 'Tyler' are still Accountants at pass 2. On the other hand, if Marion had left and I had "Jane": set(["Employee","Acct1","SrAcct"]), I'd have to ask the end user to arbitrate and identify her as a Senior Accountant.
I've worked with Jaccard Similarity (https://en.wikipedia.org/wiki/Jaccard_index) in the past, but I am unsure how it applies to cases where both sides can change (Acct2 => SrAcct as well as employee changes).
I am pretty sure this kind of logic has been needed before, so I'm hoping for recommendations for algorithms to look at and strategies to follow.
Oh, and I am looking for reasonably stand-alone approaches that I can implement, and reason about, within the context of a larger Python app. Not for machine-learning recommendations about how to configure the likes of TensorFlow to do this for me. Though, if push came to shove, I could call a batch to do the matching.

This will be a so-so answer, so apologies, but your problem is very wide and requires some logic rather than some specific code.
Perhaps this problem will be better addressed as "tags"? I mean a person could be both an employee, a guy in marketing, and a manager, all at the same time (and I presume will have permissions of all 3).
So I suggest a different approach - instead of grouping accounts by their respective permissions, and only then naming them manually, first classify and name the permissions (at least the more popular and stable among them) and then assign each employee to the correct category (or several) by giving each employee tags that encapsulate multiple permissions each.
Then, you will have quite a few users or permissions unclassified, but hopefully then you can ask users to do a bit of classification for you (for example, describe their position/permissions) and work with your approach on a much smaller problem set.
That way you can be sure that when a new employee enters, he is given the proper tag by looking at his permissions and deciding where he fits in. And when an employee leaves, it makes no difference, because he doesn't individually effects the permissions and tags.

What you're really creating here is a single tree of organizational hierarchy. Your grouping algorithm is already capable of that. You're not showing them within a single hierarchy, but they could easily be displayed that way.
The "subjective" part of your organization is deciding when it is appropriate to combine branches into a single organizational role, and deciding in which order to sort the permissions when creating the branches (i.e. do you want to have a single manager branch, with divisions below that, or do you want to have department branches, each containing a manager branch).
Unfortunately, there's no way for a machine to know those preferences. You're going to have to make all those decisions, especially if you're going to require a 0% false positive rate.
The easiest way I can think of to provide this preference information to the algorithm would be to give it an ordered list of permission "weights' it will use when building the hierarchy. For a first pass, you could just order them by how many people have that permission. It's possible that you might need more complex "weighting" than a single set of ordered permissions. For a more complex weighting, you would need to specify more complex "rules" that check membership (or non-membership) in multiple permission sets.
The second bit of information would likely be provided interactively. Given a display of the entire organizational chart, you would choose which permission sets should be combined into a single organizational set. This is where you would also assign display names for your roles to each permission set group(s).
As far as being able to respond to hires/fires, it shouldn't be a problem so long as the permissions are the same. As far as adding and removing permissions from users, you would have to store previous permissions and groupings and match them against current permissions for each user to prompt someone to either okay the change to the role permission set, or to form a new branch with the new permission.

This is what I ended up doing:
before calculating the classification for new set of user/access, save the old ones, along with their assigned names.
after the new classifications are calculated, find the closest match between the new and the old and transfer the names if the confidence is high enough.
full user match? then it's a match. I transform the user set into a sorted tuple of users to match via dictionary.
full permissions match? Again, it's a match. Again, check via a set to sorted tuple tranform lookup against a dictionary.
For each current left unmatched, I calculate a Jaccard similarity for each unmatched previous, separately on its users and its permissions. So, that could go O(N2)on the numbers of unmatched. Append each match to that classification's list. Sort the list in order of score (from the calc function below) and, last step, only pick one automatically if there is large enough difference with the next closest match.
`
class Match(object):
#these are weighing coefficients - I consider roles/permissions more important because of the expected user churn.
con_roles = .7
con_users = .3
con_other = .07
threshold = .7
def calc(self):
#could have anything you want here, really.
self.similarity = self.con_roles * self.simroles + self.con_users * self.simusers
OK, I am leaving a lot out, but basically, you can apply a simple Jaccard similarity algo to both the user and role side and put those numbers into a suitable equation to see what's a close match. If not satified, ask the user to assign the names again, as a last resort.
Hopefully that'll help someone if they end up looking for something similar.

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):
Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:
Avoid UUIDs since that makes the URL really long.
Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
Don’t care too much about encrypting the id - just don’t want it to
be a visibly sequential integer.
Note: The app would likely have 5 million records or so in the long term.

After researching a lot of options on SO, blogs etc., I ended up doing the following:
Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
Keeping the default id intact and using it as primary key.
(good hints, posts and especially this comment helped a lot)
What this solution helps achieve:
Absolutely no edits to models or post_save signals.
No collision checks needed. Avoiding one extra request to the db.
Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)
What it doesn’t help achieve/any drawbacks:
Encryption - the ID is encoded but not encrypted, so the user may
still be able to figure out the pattern to get to the id (but I dont
care about it much, as mentioned above).
A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.
For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not).
The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).
All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.
Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

Django model with hundreds of fields

I have a model with hundreds of properties. The properties can be of different types (integer, strings, uploaded files, ...). I would like to implement this complex model step by step, starting with the most important properties. I can think of two options:
Define the properties as regular model fields
Define a separate model to hold each property separately, and link it to the main model with a ForeignKey
I have not found any suggestions on how to handle models with lots of properties with django. What are the advantages / drawbacks of both approaches?

You definitely should not define your properties as ForeignKeys. Every time you need a full model, your database server will have to make hundreds of JOINs, therefore ruining your performance.
If your properties are needed almost every time you access the model, you should keep them in the same model. If not, you could make a separate Properties model and link it to your original model via OneToOneField.
I personally had such an experience. We had to build a hotel recomendation engine, and we were using Drupal back then. And as Drupal stores every custom property in a separate MySQL table, we quickly realised we should switch the framework, because every single query crashed our production servers (20+ JOINs are a deadly thing to MySQL). BTW, we ended up using a custom solution based on ElasticSearch, which handles hundreds of fields just fine.
Update: If you're lucky enough to be using a recent version of PostgreSQL, you could leverage the JSONField storage to pack all your fields to a single model field. Note, though, that you'll have to implement a validation scheme by yourself.

customer requirement.
First off, I feel your pain and wish you the best! I wish to reiterate if this wasn't the case that you should be first looking to change this as there should never be any need for hundreds of properties on a single object, it normally shows a need for an array, inheritance, or separate classes etc..
Going forward, you're going to need to heavily make use of values and values_list to only return the properties that you actually need from the database since performance will be severely crippled from this.
Since you can't do anything with the model, you should try to address your performance issues from the design side of things. The single responsibility principle should feature heavily in your website which will mean you'll only ever have a few values needed to be returned from the model. This way it really won't make much difference what option you choose since what is returned will be very limited.
Filter where you can, and use ordering sparingly.

You could group them into a few separate models, linked by OneToOneFields to the main model. That would "namespace" your data, and namespaces are "one honking great idea".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.