How badly should I avoid surrogate primary keys in SQL? - python

Short story
I have a technical problem with a third-party library at my hands that I seem to be unable to easily solve in a way other than creating a surrogate key (despite the fact that I'll never need it). I've read a number of articles on the Net discouraging the use of surrogate keys, and I'm a bit at a loss if it is okay to do what I intend to do.
Long story
I need to specify a primary key, because I use SQLAlchemy ORM (which requires one), and I cannot just set it in __mapper_args__, since the class is being built with classobj, and I have yet to find a way to reference the field of a not-yet-existing class in the appropriate PK definition argument. Another problem is that the natural equivalent of the PK is a composite key that is too long for the version of MySQL I use (and it's generally a bad idea to use such long primary keys anyway).

I always make surrogate keys when using ORMs (or rather, I let the ORMs make them for me). They solve a number of problems, and don't introduce any (major) problems.
So, you've done your job by acknowledging that there are "papers on the net" with valid reasons to avoid surrogate keys, and that there's probably a better way to do it.
Now, write "# TODO: find a way to avoid surrogate keys" somewhere in your source code and go get some work done.

"Using a surrogate key allows duplicates to be created when using a natural key would have prevented such problems" Exactly, so you should have both keys, not just a surrogate. The error you seem to be making is not that you are using a surrogate, it's that you are assuming the table only needs one key. Make sure you create all the keys you need to ensure the integrity of your data.
Having said that, in this case it seems like a deficiency of the ORM software (apparently not being able to use a composite key) is the real cause of your problems. It's unfortunate that a software limitation like that should force you to create keys you don't otherwise need. Maybe you could consider using different software.

I use surrogate keys in a db that I use reflection on with sqlalchemy. The pro is that you can more easily manage the foreign keys / relationships that exists in your tables / models. Also, the rdbms is managing the data more efficiently. The con is the data inconsistency: duplicates. To avoid this - always use the unique constraint on your natural key.
Now, I understand from your long story that you can't enforce this uniqueness because of your mysql limitations. For long composite keys mysql causes problems. I suggest you move to postgresql.

Related

Why Django doesn't have an on_update=models.CASCADE option?

My question comes from a situation where I want to emulate the ON UPDATE CASCADE in SQL (when I update an id and I have a Foreignkey, it is going to be automatically updated) in Django, but I realized that (apparently) doesn't exist a native way to do it.
I have visited this old question where he is trying to do the same.
My question is: Why Django doesn't have a native option? Is that a bad practice? If so, why? Does that bring problems to my software or structure?
Is there a simple way to do this?
Thanks in advance!
It is a bad practice to think about updating key fields. Note that this field is used for index generation and other computationally expensive operations. This is why the key fields must contain a unique and unrepeatable value that identifies the record. It does not necessarily have to have a direct meaning with the content of the register, it can be an autonumeric value. If you need to update it, you may need to put the value in another field of the table to be able to do it without affecting the relationships.
It is for this reason that you will not find in Django the operation you are looking for.

Google Cloud NDB integer vs urlsafe IDs?

The google-cloud-ndb Python library provides two ways of generating identifiers for Datastore entities:
integer_id(): Returns the ID as an integer.
urlsafe(): Returns a base64 string of the key.
If I am creating a URL mapping to a specific entity (eg: /users/<user_id>/) can I use either of these ID options?
I assume there is some benefit to using the base64 encoded version for URLs? The only issue is it results in some pretty ugly URLs, so I prefer to use the integer for aesthetics.
Is there a technical benefit (like improved performance) to using either option?
the urlsafe keys are useful if you specify your own custom ids when creating entities, since your custom id may include characters that cannot go in a url.
Also the urlsafe key has the kind and project id baked into it, which can be handy in case you get some wires crossed and pass the wrong id to the wrong spot.
I wouldn't say there is a performance benefit.
Another note about the urlsafe keys is that the format did change recently. The ndb library is backwards compatible, so in general it should've been fine, but that's something that they could possibly do again in the future, so just be aware of that.

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):
Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:
Avoid UUIDs since that makes the URL really long.
Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
Don’t care too much about encrypting the id - just don’t want it to
be a visibly sequential integer.
Note: The app would likely have 5 million records or so in the long term.
After researching a lot of options on SO, blogs etc., I ended up doing the following:
Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
Keeping the default id intact and using it as primary key.
(good hints, posts and especially this comment helped a lot)
What this solution helps achieve:
Absolutely no edits to models or post_save signals.
No collision checks needed. Avoiding one extra request to the db.
Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)
What it doesn’t help achieve/any drawbacks:
Encryption - the ID is encoded but not encrypted, so the user may
still be able to figure out the pattern to get to the id (but I dont
care about it much, as mentioned above).
A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.
For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not).
The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).
All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.
Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

Is using strings as an object identifier bad practice?

I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?
No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.
Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.

Are there reasons to use get/put methods instead of item access?

I find that I have recently been implementing Mapping interfaces on classes which on the surface fit the model (they are essentially just key-value stores with no more meta-data), but underneath they are sometimes quite complex.
Here are a couple examples of increasing severity:
An object which wraps another mapping converting all objects to strings when set.
An object which uses a local database as a back-end to store the key-value pairs.
An object which makes HTTP requests to remote servers to get/set data.
Lets suppose all of these examples seamlessly implement the Mapping interface, and the only indication that there is something fishy going on is that item access potentially could take a few seconds, and an item may not be retrievable in the same form as stored (if at all). I'm perfectly content with something like the first example, pretty okay with the second, but I'm getting kinda uncomfortable with the last.
The question is, is there a line at which the API for these models should not use item access, even though the underlying structure may feel like it fits on the surface?
It sounds like you are describing the standard anydbm module semantics. And just as anydbm can raise exception anydbm.error, so too could your subclass raise derivatives like MyDbmTimeoutError as needed. Whether you implement it as dictionary operations or function calls, the caller will still have to contend with exceptions anyway (e.g. KeyError, NameError).
I think that arbitrary "tied" hashes exist in Python 2 and 3.x is decent justification for saying it is a reasonable approach. Indeed, I've been looking for (and designing in my head) more complex bindings than simple key ⇒ value mappings without a heavy ORM SQL layer in between.
added: The more I think about it, the more Pythonic tied dictionaries seem to be. A key ⇒ value collection is a dictionary. Whether it lives in core or on disk or across the network is an implementation detail that is best abstracted away. The only substantive differences are increased latency and possible unavailability; however, on a virtual memory based OS, "core" can have higher latency than RAM and in a multiprocessing OS, "core" can become unavailable too. So these are differences in degree only, not kind.
From a strictly philosophical point of view, I don't think that there is a line you can cross with this. If some tool provides the needed functionality, but its API is different, adapt away. The only time you shouldn't do this is if the adapted to API simply is not expressive enough to manipulate the adapted component in a way that is needed.
I wouldn't hesitate to adapt a database into a dict, because that's a great way to manipulate collections, and it's already compatible with a heck of a lot of other code. If I find that my particular application must make calls to the database connections begin(), commit(), and rollback() methods to work right, then a dict won't do, since dict's don't have transaction semantics.

Categories