Table metadata transfer from sqlserver to postgresql using SQLAlchemy

Table metadata transfer from sqlserver to postgresql using SQLAlchemy - python

I am trying to migrate my database from ms sql server to PostgreSQL using python script .
Before migrating the data, script needs to create required tables.
I intend to use sqlalchemy to create required tables and then migrate the actual data. Below is the sample code. While creating table in pgsql , script is failing as there are no datatype like tinyint in pgsql. I though sqlalchemy abstracts these data types.
Any suggestions and best practices for this kind of usecase will be of great help
from sqlalchemy import create_engine, MetaData, select, func, Table
import pandas as pd
engine_pg = create_engine('postgresql://XXXX:YYYY$#10.10.1.4:5432/pgschema')
engine_ms = create_engine('mssql+pyodbc://XX:YY#10.10.1.5/msqlschema?driver=SQL+Server')
ms_metadata = MetaData(bind=engine_ms)
pg_metadata = MetaData(bind=engine_pg)
#extract Node table object from mssql using ms_metadat and engine_ms
Node = Table('Node', ms_metadata, autoload_with=engine_ms)
#create Node table in pgsql using the Node table object
Node.create(bind=engine_pg)

While I have not done the ms sql to postgreSQL path I have done some other (small to tiny) migrations and have some minor experience with both databases you are looking at. The solution to your specific problem is probably best done through a mapping functionality.
There is a library that I have looked at but never gotten around to using which contain such mappings:
https://pgloader.readthedocs.io/en/latest/ref/mssql.html?highlight=tinyint%20#default-ms-sql-casting-rules
Since a data migration is usually done just once, I would recommend making use of an existing tool. SQLAlchemy is not really such a tool from my understanding but could potentially be turned into one with some effort.
Regarding your question about SQLAlchemy abstracting the data I would not hold this situation against SQLAlchemy. Tinyint is a 1 byte data type. There is no such data type available in postgreSQL which makes a direct mapping impossible. Hence the mapping ound in pgloader (linked above).
https://learn.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15
https://www.postgresql.org/docs/9.1/datatype-numeric.html
Finally some thoughts on meta information available here. It seems like you are offering a bounty to this 6 months after you posted the original question which is interesting as it is either a huge project or one you don't allocate a lot of time to. Either way I urge you to use an existing tool rather than trying to make something work beyond its intended usage. Another thing is the inclusion of the pandas import. If you are thinking of using pandas for the data transfer I want to caution you on the fact that pandas is very forgiving with data formats. This might not be a problem for you, but a more controlled data pipeline would probably be less error prone.
Given the previous paragraph I'd like some more info on the overall situation before pointing you in the right direction. Database migration can have other unforeseen consequences as well so I don't want to give the impression that the solution to your overall problem is a quick fix as simple as a tinyint to smallint mapping.

Related

Query across all schemas on identical table on Postgres

I'm using postgres and I have multiple schemas with identical tables where they are dynamically added the application code.
foo, bar, baz, abc, xyz, ...,
I want to be able to query all the schemas as if they are a single table
!!! I don't want to query all the schemas one by one and combine the results
I want to "combine"(not sure if this would be considered a huge join) the tables across schemas and then run the query.
For example, an order by query shouldn't be like
1. schema_A.result_1
2. schema_A.result_3
3. schema_B.result_2
4. schema_B.result 4
but instead it should be
1. schema_A.result_1
2. schema_B.result_2
3. schema_A.result_3
4. schema_B.result 4
If possible I don't want to generate a query that goes like
SELECT schema_A.table_X.field_1, schema_B.table_X.field_1 FROM schema_A.table_X, schema_B.table_X
But I want that to be taken care of in postgresql, in the database.
Generating a query with all the schemas(namespaces) appended can make my queries HUGE with ~50 field and ~50 schemas.
Since these tables are generated I also cannot inherit them from some global table and query that instead.
I'd also like to know if this is not really possible in a reasonable speed.
EXTRA:
I'm using django and django-tenants so I'd also accept any answer that actually helps me generate the entire query and run it to get a global queryset EVEN THOUGH it would be really slow.

Your question isn't as much of a question as it is an admission that you've got a really terrible database and applicaiton design. It's as if you parittioned something that iddn't need to be parittioned, or partitioned it in the wrong way.
Since you're doing something awkward, the database itself won't provide you with any elegant solution. Instead, you'll have to get more and more awkward until the regret becomes too much to bear and you redesign your database and/or your application.
I urge you to repent now, the sooner the better.
After that giant caveat based on a haughty moral position, I acknolwedge that the only reason we answer questions here is to get imaginary internet points. And so, my answer is this: use a view that unions all of the values together and presents them as if they came from one table. I can't make any sense of the "order by query", so I just ignore it for now. Maybe you mean that you want the results in a certain order; if so, you can add constants to each SELECT operand of each UNION ALL and ORDER BY that constant column coming out of the union. But if the order of the rows matters, I'd assert that you are showing yet another symptom of a poor database design.
You can programatically update the view whenever it is you update or create the new schemas and their catalogs.
A working example is here: http://sqlfiddle.com/#!17/c09265/1
with this schema creation and population code:
CREATE Schema Fooey;
CREATE SCHEMA Junk;
CREATE TABLE Fooey.Baz (SomeINteger INT);
CREATE TABLE Junk.Baz (SomeINteger INT);
INSERT INTO Fooey.Baz (SomeInteger) VALUES (17), (34), (51);
INSERT INTO Junk.Baz (SomeInteger) VALUES (13), (26), (39);
CREATE VIEW AllOfThem AS
SELECT 'FromFooey' AS SourceSchema, SomeINteger FROM Fooey.Baz
UNION ALL
SELECT 'FromJunk' AS SourceSchema, SomeInteger FROM Junk.Baz;
and this query:
SELECT *
FROM AllOfThem
ORDER BY SourceSchema;
Why are per-tenant schemas a bad design?
This design favors laziness over scalability. If you don't want to make changes to your application, you can simply slam connections to a particular shcema and keep working without any code changes. Adding more tennants means adding more schemas, which it sounds like you've automated. Adding many schemas will eventually make database management cumbersome (what if you have thousands or millions of tenants?) and even if you have only a few, the dynamic nature of the list and the problems in writing system-wide queries is an issue that you've already discovered.
Consider instead combining everything and adding the tenant ID as part of a key on each table. In that case, adding more tenants means adding more rows. Any summary queries trivially come from single tables, and all of the features and power of the database implementation and its query language are at your fingertips without any fuss whatsoever.
It's simply false that a database design can't be changed, even in an existing and busy system. It takes a lot of effort to do it, but it can be done and people do it all the time. That's why getting the database design right as early as possible is important.
The README of the django-tenants package you're using describes thier decision to trade-off towards laziness, and cites a whitpaper that outlines many of the shortcomings and alternatives of that method.

Database data Integrity check using Python SqlAlchemy or Sql Query

I imported and will keep importing data from different sources into sql server database. Some logic check such as sum of one column should be in 5 dollars difference with one amount in another table. It is hard for me to check this logic when importing because some data will be manually inserted, some imported using Excel VBA or python. Data Accuracy is very important to me. I am thinking to check after the data inserted. I am thinking two choices
Python with SqlAlchemy and write the logic in python
Create a stored procedure or direct SQL to verify
What will be advantages and disadvantages of SQLAlchemy vs stored procedure for data check? or other solutions?
The Benefit in my mind for SQLAlchemy with automap:
Possible combination use of Jupyter for nice user interface
Logic is easier to write, such as loop one table each row should be sum of another table with some where conditions.
Benefit for SQL stored procedure
Can all be managed in SQL server management studio

The answer is neither. Most RDBMS have built in mechanisms to enforce restrictions on the data that is inserted into a row or a column. As you said
It is hard for me to check this logic when importing because some data
will be manually inserted, some imported using Excel VBA or python.
Data Accuracy is very important to me
You can't have your code in all these places. What works is constraints.
CHECK constraints enforce domain integrity by limiting the values that
are accepted by one or more columns. You can create a CHECK constraint
with any logical (Boolean) expression that returns TRUE or FALSE based
on the logical operators
You can of course use stored procedures for this but constraints are more efficient, transparent and easier to maintain.

Storing the results of a SQLAlchemy query to merge into another session

I have a SQLAlchemy-based tool for selectively copying data between two different databases for testing purposes. I use the merge() function to take model objects from one session and store them in another session. I'd like to be able to store the source objects in some intermediate form and then merge() them at some later point in time.
It seems like there are a few options to accomplish this:
Exporting DELETE/INSERT SQL statements. Seems pretty straightforward, I think I can get SQLAlchemy to give me the INSERT statements, and maybe even the DELETEs.
Exproting the data to a SQLite database file with the same (or similar) schema, that could then be read in as a source at a later point in time.
Serializing the data in some manner and then reading them back into memory for the merge. I don't know if SQLAlchemy has something like this built-in or not. I'm not sure what the challenges would be in rolling this myself.
Has anyone tackled this problem before? If so, what was your solution?
EDIT: I found a tool built on top of SQLAlchemy called dataset that provides the freeze functionality I'm looking for, but there seems to be no corresponding thaw functionality for restoring the data.

I haven't used it before, but the dogpile caching techniques described in the documentation might be what you want. This allows you to query to and from a cache using the SQLAlchemy API:
http://docs.sqlalchemy.org/en/rel_0_9/orm/examples.html#module-examples.dogpile_caching

how to store a SQL-database in a python object, and perform queries in the object?

I have a big postgrSQL database. I would like to somehow store the full database in a python object, which form/structure would reflect the one of the database. Namely I imagine something like
* An object database, with an attribute .tables which a kind of list of object "table", and a table object has an attribute "list_of_keys" (list of the column names) and an attribute "rows", which reflects all the rows of the corresponding table in the database.
Now, the main point i need is: i want to be able to perform a search in the database object, with exactely the same SQL synthax that i would use in the corresponding SQL database. Thus something like
database.execute("SELECT * FROM .....")
where, i repeat, "database" is a purely python object (which was filled with data coming from an SQL database, but which is now independent of it).
My aim is: i want to be able to apply the same algorithm either on a SQL-Database, or on a python-object, such as described above, without changing my code. So, i imagine, let "database" be either a usual database-connector/cursor (like with psycopg, for example), or a python object as i described, and the same piece of code
database.execute("SELECT BLABLABLA")
would work in both cases.
Is there any known module which allows that ?
thanks.

It might get a bit complicated, but take a look at SQLite's in-memory storage:
import sqlite3
cnx = sqlite3.connect(':memory:')
cnx.execute('CREATE TABLE ...')
There are some differences in the SQL syntax, but the basic stuff works fine. This might also take a good amount of RAM, depending on your data.

Will two programs using SqlAlchemy conflict when trying to access the same table (SQLite)?

I'm going to have two independent programs (using SqlAlchemy / ORM / Declarative)
that will inevitably try to access the same database-file/table(SQLite) at the same time.
They could both want to read or write to that table.
Will there be a conflict when this happens?
If the answer is yes, how could this be handled?

Sqlite is resistant to any issues as you describe. http://www.sqlite.org/howtocorrupt.html gives you details on what could cause problems, and they're generally isolated from anything the code might accidentally do.
If you're concerned due to the nature of your application data access, use BEGIN TRANSACTION and COMMIT/ROLLBACK as appropriate. If your transactions are single query access (that is, you're not reading a value in one query and then changing it in another relative to what you already read), this should not be necessary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.