May I use bson.objectid.ObjectId as (primary key) id in sql? - python

I am thinking if I don't use auto id as primary id in mysql but use other method to implement, may I replace auto id from bson.objectid.ObjectId in mysql?
According to ObjectId description, it's composed of:
a 4-byte value representing the seconds since the Unix epoch
a 3-byte machine identifier
a 2-byte process id
a 3-byte counter, starting with a random value.
It seems it can provide unique and not duplicate key. Is it a good idea?

You certainly could do this. One issue though is that since this can't be set by the database itself, you'll need to write some Python code to ensure it is set on save.
Since you're not using MongoDB, though, I wonder why you want to use a BSON id. Instead you might want to consider using UUID, which can indeed be set automatically by the db.

Related

Google Cloud NDB integer vs urlsafe IDs?

The google-cloud-ndb Python library provides two ways of generating identifiers for Datastore entities:
integer_id(): Returns the ID as an integer.
urlsafe(): Returns a base64 string of the key.
If I am creating a URL mapping to a specific entity (eg: /users/<user_id>/) can I use either of these ID options?
I assume there is some benefit to using the base64 encoded version for URLs? The only issue is it results in some pretty ugly URLs, so I prefer to use the integer for aesthetics.
Is there a technical benefit (like improved performance) to using either option?
the urlsafe keys are useful if you specify your own custom ids when creating entities, since your custom id may include characters that cannot go in a url.
Also the urlsafe key has the kind and project id baked into it, which can be handy in case you get some wires crossed and pass the wrong id to the wrong spot.
I wouldn't say there is a performance benefit.
Another note about the urlsafe keys is that the format did change recently. The ndb library is backwards compatible, so in general it should've been fine, but that's something that they could possibly do again in the future, so just be aware of that.

Postgres/SQLAlchemy/Alembic change datatype from enum to int and map value with using

I've been using a setup using SQLAlchemy and Alembic migrations together with a Postgres DB in the backend. I've been using enums in some of the DB-Models as datatypes and I've been using the Enum-Type in SQLAlchemy and (because of this) also inside Postgres. Throughout the continuing development I decided to replace those enums with integers, for some smaller reasons (including higher flexibility when adding new values, which would require an DB-migration right now that alters the data-type and altering those in alembic is rather complicated and enums seem to be less portable to other db systems than ints anyways). I have to create a migration because of this that alters the datatype of the enum columns to int using alembic migrations.
Sadly I did not find any conversion method yet, that would allow me to keep the old values throughout the migration. I've been using IntEnums, that map the enum value to integers in python, but Postgres does not use this mapping internally, so it has no information about the supposed value of the enum when converting it to an integer. From my perspective, I have to somehow provide an explicit mapping of enum-name to integer when applying the alter table query. Postgres provides the using statement, but I have been unable to find out how to use this to map the value from my enum-values to the new/old integer values that python has been using anyways.
So how would such using statement have to look like if it should map the old (enum-based) value to a new (int based) value if I had to manually provide a mapping for that?
An example of such an enum:
class SexEnum(enum.IntEnum):
male = 0
female = 1
Column (old in the DB):
sex = db.Column(db.Enum(SexEnum))
Column (new):
sex = db.Column(db.Integer)
Migration command right now:
op.alter_column('users', 'sex',
type_=sa.Integer, postgresql_using='null')
Null probably has to be replaced with something else.
Also, it should be noted that it would be great if the migration would be as compatible to other DB systems as possible, which would force me to use another system than the using statement anyways, wouldn't it?
Okay found it. Thanks to Ilja for the tip with the CASE.
Basically replaced 'null' with
'(CASE sex WHEN \'male\'... END)'
Which brought the migration live without having to set all those values in another way. I dont think this fully fulfills the portability requirement I had, simply because I'm still using the using clause of postgres. Apart from that, the simple case statement works.

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):
Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:
Avoid UUIDs since that makes the URL really long.
Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
Don’t care too much about encrypting the id - just don’t want it to
be a visibly sequential integer.
Note: The app would likely have 5 million records or so in the long term.
After researching a lot of options on SO, blogs etc., I ended up doing the following:
Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
Keeping the default id intact and using it as primary key.
(good hints, posts and especially this comment helped a lot)
What this solution helps achieve:
Absolutely no edits to models or post_save signals.
No collision checks needed. Avoiding one extra request to the db.
Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)
What it doesn’t help achieve/any drawbacks:
Encryption - the ID is encoded but not encrypted, so the user may
still be able to figure out the pattern to get to the id (but I dont
care about it much, as mentioned above).
A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.
For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not).
The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).
All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.
Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

How badly should I avoid surrogate primary keys in SQL?

Short story
I have a technical problem with a third-party library at my hands that I seem to be unable to easily solve in a way other than creating a surrogate key (despite the fact that I'll never need it). I've read a number of articles on the Net discouraging the use of surrogate keys, and I'm a bit at a loss if it is okay to do what I intend to do.
Long story
I need to specify a primary key, because I use SQLAlchemy ORM (which requires one), and I cannot just set it in __mapper_args__, since the class is being built with classobj, and I have yet to find a way to reference the field of a not-yet-existing class in the appropriate PK definition argument. Another problem is that the natural equivalent of the PK is a composite key that is too long for the version of MySQL I use (and it's generally a bad idea to use such long primary keys anyway).
I always make surrogate keys when using ORMs (or rather, I let the ORMs make them for me). They solve a number of problems, and don't introduce any (major) problems.
So, you've done your job by acknowledging that there are "papers on the net" with valid reasons to avoid surrogate keys, and that there's probably a better way to do it.
Now, write "# TODO: find a way to avoid surrogate keys" somewhere in your source code and go get some work done.
"Using a surrogate key allows duplicates to be created when using a natural key would have prevented such problems" Exactly, so you should have both keys, not just a surrogate. The error you seem to be making is not that you are using a surrogate, it's that you are assuming the table only needs one key. Make sure you create all the keys you need to ensure the integrity of your data.
Having said that, in this case it seems like a deficiency of the ORM software (apparently not being able to use a composite key) is the real cause of your problems. It's unfortunate that a software limitation like that should force you to create keys you don't otherwise need. Maybe you could consider using different software.
I use surrogate keys in a db that I use reflection on with sqlalchemy. The pro is that you can more easily manage the foreign keys / relationships that exists in your tables / models. Also, the rdbms is managing the data more efficiently. The con is the data inconsistency: duplicates. To avoid this - always use the unique constraint on your natural key.
Now, I understand from your long story that you can't enforce this uniqueness because of your mysql limitations. For long composite keys mysql causes problems. I suggest you move to postgresql.

Python MySQLdb: Update if exists, else insert

I am looking for a simple way to query an update or insert based on if the row exists in the first place. I am trying to use Python's MySQLdb right now.
This is how I execute my query:
self.cursor.execute("""UPDATE `inventory`
SET `quantity` = `quantity`+{1}
WHERE `item_number` = {0}
""".format(item_number,quantity));
I have seen four ways to accomplish this:
DUPLICATE KEY. Unfortunately the primary key is already taken up as a unique ID so I can't use this.
REPLACE. Same as above, I believe it relies on a primary key to work properly.
mysql_affected_rows(). Usually you can use this after updating the row to see if anything was affected. I don't believe MySQLdb in Python supports this feature.
Of course the last ditch effort: Make a SELECT query, fetchall, then update or insert based on the result. Basically I am just trying to keep the queries to a minimum, so 2 queries instead of 1 is less than ideal right now.
Basically I am wondering if I missed any other way to accomplish this before going with option 4. Thanks for your time.
Mysql DOES allow you to have unique indexes, and INSERT ... ON DUPLICATE UPDATE will do the update if any unique index has a duplicate, not just the PK.
However, I'd probably still go for the "two queries" approach. You are doing this in a transaction, right?
Do the update
Check the rows affected, if it's 0 then do the insert
OR
Attempt the insert
If it failed because of a unique index violation, do the update (NB: You'll want to check the error code to make sure it didn't fail for some OTHER reason)
The former is good if the row will usually exist already, but can cause a race (or deadlock) condition if you do it outside a transaction or have your isolation mode is not high enough.
Creating a unique index on item_number in your inventory table sounds like a good idea to me, because I imagine (without knowing the details of your schema) that one item should only have a single stock level (assuming your system doesn't allow multiple stock locations etc).

Categories