Google Cloud NDB integer vs urlsafe IDs?

Google Cloud NDB integer vs urlsafe IDs? - python

The google-cloud-ndb Python library provides two ways of generating identifiers for Datastore entities:
integer_id(): Returns the ID as an integer.
urlsafe(): Returns a base64 string of the key.
If I am creating a URL mapping to a specific entity (eg: /users/<user_id>/) can I use either of these ID options?
I assume there is some benefit to using the base64 encoded version for URLs? The only issue is it results in some pretty ugly URLs, so I prefer to use the integer for aesthetics.
Is there a technical benefit (like improved performance) to using either option?

the urlsafe keys are useful if you specify your own custom ids when creating entities, since your custom id may include characters that cannot go in a url.
Also the urlsafe key has the kind and project id baked into it, which can be handy in case you get some wires crossed and pass the wrong id to the wrong spot.
I wouldn't say there is a performance benefit.
Another note about the urlsafe keys is that the format did change recently. The ndb library is backwards compatible, so in general it should've been fine, but that's something that they could possibly do again in the future, so just be aware of that.

Related

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):
Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:
Avoid UUIDs since that makes the URL really long.
Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
Don’t care too much about encrypting the id - just don’t want it to
be a visibly sequential integer.
Note: The app would likely have 5 million records or so in the long term.

After researching a lot of options on SO, blogs etc., I ended up doing the following:
Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
Keeping the default id intact and using it as primary key.
(good hints, posts and especially this comment helped a lot)
What this solution helps achieve:
Absolutely no edits to models or post_save signals.
No collision checks needed. Avoiding one extra request to the db.
Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)
What it doesn’t help achieve/any drawbacks:
Encryption - the ID is encoded but not encrypted, so the user may
still be able to figure out the pattern to get to the id (but I dont
care about it much, as mentioned above).
A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.
For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not).
The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).
All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.
Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

Is using strings as an object identifier bad practice?

I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?

No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.

Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.

map data type for python google app engine

I would like to have a map data type for one of my entity types in my python google app engine application. I think what I need is essentially the python dict datatype where I can create a list of key-value mappings. I don't see any obvious way to do this with the provided datatypes in app engine.
The reason I'd like to do this is that I have a User entity and I'd like to track within that user a mapping of lessonIds to values that represent that user's status with a particular lesson id. I'd like to do this without creating a whole new entity that might be titled UserLessonStatus and have it reference the User and have to be queried, since I often want to iterate through all the lesson statuses. Maybe it is better done this way, in which case, I'd appreciate opinions that this is how it's best done. Otherwise if someone knows a good way to create a mapping within my User entity itself, that'd be great.
One solution I considered is using two ListProperties in conjunction, i.e. when adding an object append the key and value to each list; when locating, find the index of the string in one list and reference using that index in the other; when removing, find the index in one, use it to remove from each, and so forth.

You're probably better off using another kind, as you suggest. If you do want to store it all in the one entity, though, you have several options - parallel lists, as you suggest, are one option. You could also simply pickle a Python dictionary, assuming you don't want to query on it.
You may want to check out the ndb project, which supports nested entities, which would also be a viable solution.

How badly should I avoid surrogate primary keys in SQL?

Short story
I have a technical problem with a third-party library at my hands that I seem to be unable to easily solve in a way other than creating a surrogate key (despite the fact that I'll never need it). I've read a number of articles on the Net discouraging the use of surrogate keys, and I'm a bit at a loss if it is okay to do what I intend to do.
Long story
I need to specify a primary key, because I use SQLAlchemy ORM (which requires one), and I cannot just set it in __mapper_args__, since the class is being built with classobj, and I have yet to find a way to reference the field of a not-yet-existing class in the appropriate PK definition argument. Another problem is that the natural equivalent of the PK is a composite key that is too long for the version of MySQL I use (and it's generally a bad idea to use such long primary keys anyway).

I always make surrogate keys when using ORMs (or rather, I let the ORMs make them for me). They solve a number of problems, and don't introduce any (major) problems.
So, you've done your job by acknowledging that there are "papers on the net" with valid reasons to avoid surrogate keys, and that there's probably a better way to do it.
Now, write "# TODO: find a way to avoid surrogate keys" somewhere in your source code and go get some work done.

"Using a surrogate key allows duplicates to be created when using a natural key would have prevented such problems" Exactly, so you should have both keys, not just a surrogate. The error you seem to be making is not that you are using a surrogate, it's that you are assuming the table only needs one key. Make sure you create all the keys you need to ensure the integrity of your data.
Having said that, in this case it seems like a deficiency of the ORM software (apparently not being able to use a composite key) is the real cause of your problems. It's unfortunate that a software limitation like that should force you to create keys you don't otherwise need. Maybe you could consider using different software.

I use surrogate keys in a db that I use reflection on with sqlalchemy. The pro is that you can more easily manage the foreign keys / relationships that exists in your tables / models. Also, the rdbms is managing the data more efficiently. The con is the data inconsistency: duplicates. To avoid this - always use the unique constraint on your natural key.
Now, I understand from your long story that you can't enforce this uniqueness because of your mysql limitations. For long composite keys mysql causes problems. I suggest you move to postgresql.

Examples of use for PickledObjectField (django-picklefield)?

surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?

We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.

You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.

You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).

The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.