Moving data from MySQL to SAP HANA with python - python

I'm trying to migrate data from a MySQL DB to HANA utilizing Python. The way we're currently implementing this migration at work is manually but the plan is to run a script everyday to collect data from the prior day (stored in MySQL) and move it to HANA to use their analytics tools. I have written a script with 2 functions, one that connects to MySQL and temporarily stores the data from the query in a Pandas Dataframe. The second function uses the sqlalchemy-hana connector to create an engine that I feed into Pandas' to_sql function to store the data into HANA.
Below is the first function call to MySQL
def connect_to_mysql(query):
try:
#connect to the db
stagedb = myscon.connect(
user = 'user-name',
password = 'password',
host = 'awshost.com',
database = 'sampletable',
raise_on_warnings = True)
df = pandas.read_sql(query, stagedb)
except myscon.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print('Incorrect user name or password')
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print("Database does not exit")
else:
print(err)
finally:
if central_stagedb:
central_stagecur.close()
central_stagedb.close()
return df
This is the second function call to connect to HANA
def connect_to_hana(query):
#connect to HANA db
try:
engine = create_engine('hana://username:password#host:port')
#return dataframe from first function
to_df = connect_to_mysql(query)
to_df.to_sql('sample_data', engine, if_exists = 'append', index = False, chunksize=20000)
except: raise
My HAHA DB has several schemas in the catalog folder, many of them "SYS" or "_SYS" related. I have created a separate schema to test my code on and play around in, which has the same name as my username.
My questions are as such: 1) is there a more efficient way to load data from MySQL to Hana without using a go-between like a CSV file or, in my case, a Pandas Dataframe. Using VS Code it takes around 90 seconds for the script to complete and 2) when using the sqlalchemy-hana connector, how does it know which schema to create the table and store the data/append the data to? The read-me file didn't really explain. Luckily it's storing it in the right schema (the one with my username) but I created another one as a test and of course the table didn't show up under that one. If I try to specify the database in the create_engine line as so:
engine = create_engine('hana://username:password#host:port/Username')
I get this error: TypeError: connect() got an unexpected keyword argument 'database'.
Also, I noticed that say if I were to run my script twice and count the number of rows in the created table, it adds the rows twice - essentially creating duplicates. Because of this, 3) would it be better to iterate throw the rows of the Dataframe and insert the rows one by one using the pyhdb package?
Any advice/suggestions/answers will be very much appreciated! Thank you!

Gee... that seems like a rather complicated workflow. Alternatively, you may want to check the HANA features Smart Data Access (SDA) and Smart Data Integration (SDI). With these, you could either establish a "virtual" data access in SAP HANA, that is, you read data from the MySQL DB into the HANA process when you run your analytics query. Or you could actually load the data into HANA, making it a data mart.
If it is really just about the "piping" for this data transfer, I probably wouldn't put 3rd party tools into the scenario. This only makes the setup more complicated than necessary.

Related

Passing a JSON Extract from a rest api to a SQL Server in python

I'm very new to py programming; please excuse if its a silly question.
I need to perform two actions
Extract json file from a Rest API (which will have a very large number of Key/Value pairs)
Pass the extract in a tabular format to a SQL Server
I have written a sample function consisting of only 03 parameters which are passed on to a SQL Server on my system.
How will this function change if there are unknown number of parameters as in the case of a json extract?
def InsertJsonsql(Name, City, Age):
connection = pyodbc.connect('Driver={SQL Server};'
'Server=IN2367403W1\SQLEXPRESS;' #IN2367403W1\SQLEXPRESS : use server name
'Database=TestDB;'
'Trusted_Connection=yes;')
cursor = connection.cursor()
json_insert_query = """
INSERT INTO TestDB.dbo.Person (Name, City, Age) VALUES ('{}', '{}', '{}')
""".format(Name, City, Age)
cursor.execute(json_insert_query)
connection.commit()
print("Record inserted successfully into Person table")
As mentioned from #shimo in the comments a relational database like mysql or Windows SQL are fixed when created. So you can't add new columns to handle unknown number of parameters.
As far as I know you have 2 options:
1.) Use a NoSQL database like MongoDB. There the data is saved as JSON and therefore it can handle any number of key-value pairs. This is the safest and best method.
2.) If you have to use a relational database you may have to create a column in your table that can handle a long String and save your data as JSON String in this column. This is not a good solution and you should rather use a NoSQL database.

python pypyodbc won't select data

I think I'm going mad here... again :). I'm trying to do the most simple thing on the planet and it doesn't work for some reason unknown to me. I have a python script that connects to a mssql database using pypyodbc and does stuff. when I insert data into the database, it works. when I try to extract it, it fails miserably. what am I doing wrong?
import pypyodbc as mssql
msConnErr = None
try:
msconn = mssql.connect('DRIVER={SQL Server};SERVER=server_name;DATABASE=database;TRUSTED_CONNECTION=True')
print('Source server connected')
srcCursor = msconn.cursor()
except:
print('Source server error')
msConnErr = True
srcCursor.execute("SELECT * FROM schema.table")
srcResult = srcCursor.fetchall()
print(srcResult)
the connection works as I'm being given a successful message. I can also see my script using sql server management studio being connected to the correct database, so I know I'm working in the right environment. the error I'm getting is:
UndefinedTable: relation "schema.table" does not exist
LINE 1: SELECT * FROM schema.table
the table exists, I must specify the schema as I have the same table name in different schemas (data lifecycle). I can extract data from it using sql server management studio, yet python fails miserably. it doesn't fail to insert 35 million rows in it using the same driver. no other query works, even SELECT ##VERSION fails, SELECT TOP (10) * FROM schema.table fails etc. ...
any ideas?
basically, I had a piece of code that would rewrite the srcCursor variable with another connection, obviously that relation wouldn't be present on another server. apologies!

insert records do not show up in postgres

I am using python to perform basic ETL to transfer records from a mysql database to a postgres database. I am using python to commence the tranfer:
python code
source_cursor = source_cnx.cursor()
source_cursor.execute(query.extract_query)
data = source_cursor.fetchall()
source_cursor.close()
# load data into warehouse db
if data:
target_cursor = target_cnx.cursor()
#target_cursor.execute("USE {};".format(datawarehouse_name))
target_cursor.executemany(query.load_query, data)
print('data loaded to warehouse db')
target_cursor.close()
else:
print('data is empty')
MySQL Extract (extract_query):
SELECT `tbl_rrc`.`id`,
`tbl_rrc`.`col_filing_operator`,
`tbl_rrc`.`col_medium`,
`tbl_rrc`.`col_district`,
`tbl_rrc`.`col_type`,
DATE_FORMAT(`tbl_rrc`.`col_timestamp`, '%Y-%m-%d %T.%f') as `col_timestamp`
from `tbl_rrc`
PostgreSQL (loading_query)
INSERT INTO geo_data_staging.tbl_rrc
(id,
col_filing_operator,
col_medium,
col_district,
col_type,
col_timestamp)
VALUES
(%s,%s,%s,%s,%s);
Of note, there is a PK constraint on Id.
The problem is while I have no errors, I'm not seeing any of the records in the target table. I tested this by manually inserting a record, then running again. The code errored out violating PK constraint. So I know it's finding the table.
Any idea on what I could be missing, I would be greatly appreciate it.
Using psycopg2, you have to call commit() on your cursors in order for transactions to be committed. If you just call close(), the transaction will implicitly roll back.
There are a couple of exceptions to this. You can set the connection to autocommit. You can also use your cursors inside a with block, which will automatically commit if the block doesn't throw any exceptions.

Update MSSQL table through SQLAlchemy using dataframes

I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.

Iterate loop in Python through multiple databases in SQL

Currently have a remote SQL server without multiple database structures on it. Connecting through Python code using PyMSSQL plugin and extracting data into pandas before applying some analysis. Is there a way to iterate such that with each loop, the database number changes, allowing a new database's data to be analysed?
E.g.
*connect to server
cursor.execute("SELECT TOP 100 *variable name* FROM *database_1*")
*analyse
*disconnect server
Ideally would have a loop allowing me to automatically read data from say database_1 through to database_10
IIUC you can easily do this using read_sql() method:
engine = create_engine('mssql+pymssql://USER:PWD#hostname/db_name')
for i in range(1,10):
qry = 'SELECT TOP 100 variable name FROM database_{}'.format(i)
df = pd.read_sql(qry, engine)
# analyse ...

Categories