Advanced migration between Databases using Python, SQLAlchemy, and Alembic

Advanced migration between Databases using Python, SQLAlchemy, and Alembic - python

--- UPDATED to be more clear ---
I have quite the task ahead of me, and I am hoping Alembic and SQLAlchemy can do what I need.
I want to develop a desktop application that helps migrate data from one SQLite database to a completely different model and back again. So migrations right? That’s all well and good, but I need to ensure if I haven’t modeled a specific column/table, that it’s ported properly to a table that will be read later to reconstruct the database.
Example:
DB 1:
Table names
ID
First Name
Last Name
Table address
Street 1
Street 2
City
State
DB2:
Table givenName
ID
name
Table Surname
ID
Name
Say with alembic I have mapped the following:
DB1 names.firstname => DB2 givenName.name
DB1 names.lastname => DB2 surname.name
But say I want to use a migration to port DB1 to DB2, store the unknown data somewhere, and then reconstruct it properly when I go from DB2 -> DB1.
So how I’d envision this is a joiner table of sorts.
DB2
Table Joiner
table_name
column_name
data
The thing is I want this all to be completely dynamic so no piece of information is ever lost.
Now let's add an extra complexity, and I want to construct a generator of sorts so I can simply pass down new XML/JSON declarations. This would define the mappings, and any calls to translators already based in the code (date conversions, etc).
For an example of two database formats that need to be migrated from one to the other see https://cloudup.com/cYzP2lCQjbo
My question is if this is even possible or conceivable with SQLAlchemy and Alembic. How feasible is this? Any thoughts?

Related

How can I have a database with thousands of tables with varying number of columns that are all of the same class in Django / SQLAlchemy ORM?

I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.

You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.

Does Django provide any built-in way to update PostgreSQL autoincrement counters?

I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)

About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.

One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.

django postgresql query not working

I have a postgreSQL database that has a table foo that I've created outside of django. I used manage.py inspectdb to build the model for table foo for me. This technique worked fine when I was using MySQL but with PostgreSQL it is failing MISERABLY. The table is multiple gigabytes and I build it from a text file with PostgreSQL 'COPY'.
I can run raw queries on table foo and everything executes and expected.
For example
foo.objects.raw('bar_sql')
executes as expected.
But running queries like:
foo.objects.get(bar=bar)
throw
ProgrammingError column foo.id does not exist LINE 1: SELECT "foo"."id", "foo"."bar1", "all_...
foo doesn't innately have an id field. As I understand it django is suppose to create one. Have I some how subverted this step when creating the tables outside of django?
Queries run on models whose table was populated threw django run as expected in all cases.
I'm missing something very basic here and any help would be appreciated.
I'm using django 1.6 with postgreSQL 9.3.

Django doesn't modify your existing database tables. It only creates new tables. If you have existing tables, it usually doesn't touch them at all.
"As I understand it django is suppose to create one." --> It only adds a primary key to a table when it creates it, which means you don't need to specify that explicitly in your model, but it won't do anything to an existing table.
So if for example you later on decide to add fields to your models, you have to update your databases manually.
What you need to do in your case is that by doing manual database administration make sure that your table has a primary key, and also that the name of the primary key is "id" (although I am not sure if this is necessary, it is better to do it.) So use a database administration tool, modify your table and add the primary key, and name it id. Then it should start working.

Flask-SQLAlchemy - When are the tables/databases created and destroyed?

I am a little confused with the topic alluded to in the title.
So, when a Flask app is started, does the SQLAlchemy search theSQLALCHEMY_DATABASE_URI for the correct, in my case, MySQL database. Then, does it create the tables if they do not exist already?
What if the database that is programmed into theSQLALCHEMY_DATABASE_URI variable in the config.py file does not exist?
What if that database exists, and only a few of the tables exist (There are more tables coded into the SQLAlchemy code than exist in the actual MySQL database)? Does it erase those tables and then create new tables with the current specs?
And what if those tables do all exist? Do they get erased and re-created?
I am trying to understand how the entire process works so that I (1) Don't lose database information when changes are made to the schema, and (2) can write the necessary code to completely manage how and when the SQLAlchemy talks to the actual Database.

Tables are not created automatically; you need to call the SQLAlchemy.create_all() method to explicitly to have it create tables for you:
db = SQLAlchemy(app)
db.create_all()
You can do this with command-line utility, for example. Or, if you deploy to a PaaS such as Google App Engine, a dedicated admin-only view.
The same applies for database table destruction; use the SQLAlchemy.drop_all() method.
See the Creating and Dropping tables chapter of the documentation, or take a look at the database chapter of the Mega Flask Tutorial.
You can also delegate this task to Flask-Migrate or similar schema versioning tools. These help you record and edit schema creation and migration steps; the database schema of real-life projects is never static and you would want to be able to move existing data between versions or the schema. Creating the initial schema is then just the first step.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?

Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)

Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.