Fetch from one database, Insert/Update into another using SQLAlchemy - python

We have data in a Snowflake cloud database that we would like to move into an Oracle database. As we would like to work toward refreshing the Oracle database regularly, I am trying to use SQLAlchemy to automate this.
I would like to do this using Core because my team is all experienced with SQL, but I am the only one with Python experience. I think it would be easier to tweak the data pulls if we just pass SQL strings. Plus the Snowflake db has some columns with JSON that seems easier to parse using direct SQL since I do not see JSON in the SnowflakeDialect.
I have established connections to both databases and am able to do select queries from both. I have also manually created the tables in our Oracle db so that the keys and datatypes match what I am pulling from Snowflake. When I try to insert, though, my Jupyter notebook just continuously says "Executing Cell" and hangs. Any thoughts on how to proceed or how to get the notebook to tell me where the hangup is?
from sqlalchemy import create_engine,pool,MetaData,text
from snowflake.sqlalchemy import URL
import pandas as pd
eng_sf = create_engine(URL( #engine for snowflake
account = 'account'
user = 'user'
password = 'password'
database = 'database'
schema = 'schema'
warehouse = 'warehouse'
role = 'role'
timezone = 'timezone'
))
eng_o = create_engine("oracle+cx_oracle://{}[{}]:{}#{}".format('user','proxy','password','database'),poolclass=pool.NullPool) #engine for oracle
meta_o = MetaData()
meta_o.reflect(bind=eng_o)
person_o = meta_o['bb_lms_person'] # other oracle tables follow this example
meta_sf = MetaData()
meta_sf.reflect(bind=eng_sf,only=['person']) # other snowflake tables as well, but for simplicity, let's look at one
person_sf = meta_sf.tables['person']
person_query = """
SELECT ID
,EMAIL
,STAGE:student_id::STRING as STUDENT_ID
,ROW_INSERTED_TIME
,ROW_UPDATED_TIME
,ROW_DELETED_TIME
FROM cdm_lms.PERSON
"""
with eng_sf.begin() as connection:
result = connection.execute(text(person_query)).fetchall() # this snippet runs and returns result as expected
with eng_o.begin() as connection:
connection.execute(person_o.insert(),result) # this is a coinflip, sometimes it runs, sometimes it just hangs 5ever
eng_sf.dispose()
eng_o.dispose()
I've checked the typical offenders. The keys for both person_o and the result are all lowercase and match. Any guidance would be appreciated.

use the metadata for the table. the fTable_Stage update or inserted as fluent functions and assign values to lambda variables. This is very safe because only metadata field variables can be used in the lambda. I am updating three fields:LateProbabilityDNN, Sentiment_Polarity, Sentiment_Subjectivity
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
connection=engine.connect()
metadata=MetaData()
Session = sessionmaker(bind = engine)
session = Session()
fTable_Stage=Table('fTable_Stage', metadata,autoload=True,autoload_with=engine)
stmt=fTable_Stage.update().where(fTable_Stage.c.KeyID==keyID).values(\
LateProbabilityDNN=round(float(late_proba),2),\
Sentiment_Polarity=round(my_valance.sentiment.polarity,2),\
Sentiment_Subjectivity= round(my_valance.sentiment.subjectivity,2)\
)
connection.execute(stmt)

Related

Project involving an API, a database and a visualisation package

Firstly this is my first post on StackOverflow and so if I haven't structured my post properly, please let me know. Basically, I'm new to Python but I've been trying to connect an API to Python, from Python to a database that is hosted online, and then finally into a visualization package. I'm running into some problems when inserting the API data (Sheffield Solar) from Python into my database. The data does actually upload to the database but I'm struggling with an error message that I get in Python.
from datetime import datetime, date
import pytz
import psycopg2
import sqlalchemy
from pandas import DataFrame
from pvlive_api import PVLive
from sqlalchemy import create_engine, Integer, String, DATETIME, FLOAT
def insert_data():
""" Connect to the PostgreSQL database server """
# Calling the class from the pvlive_api.py file
data = PVLive()
# Gets the data between the two dates from the API and converts the output into a dataframe
dl = data.between(datetime(2019, 4, 5, 10, 30, tzinfo=pytz.utc),
datetime(2020, 4, 5, 14, 0, tzinfo=pytz.utc), entity_type="pes",
entity_id=0, dataframe=True)
# sql is used to insert the API data into the database table
sql = """INSERT INTO sheffield (pes_id, datetime_gmt, generation_mw) VALUES (%s, %s, %s)"""
uri = "Redacted"
print('Connecting to the PostgreSQL database...')
engine = create_engine(
'postgresql+psycopg2://Redacted')
# connect to the PostgreSQL server
conn = psycopg2.connect(uri)
# create a cursor that allows python code to execute Postgresql commands
cur = conn.cursor()
# Converts the data from a dataframe to an sql readable format, it also appends new data to the table, also
# prevents the index from being included in the table
into_db = dl.to_sql('sheffield', engine, if_exists='append', index=False)
cur.execute(sql, into_db)
# Commits any changes to ensure they actually happen
conn.commit()
# close the communication with the PostgreSQL
cur.close()
def main():
insert_data()
if __name__ == "__main__":
main()
The error I'm getting is as follows:
psycopg2.errors.SyntaxError: syntax error at or near "%"
LINE 1: ...eld (pes_id, datetime_gmt, generation_mw) VALUES (%s, %s, %s...
with the ^ pointing at the first %s. I'm assuming that the issue is due to me using into_db as my second argument in cur.execute(), however, as I mentioned earlier the data still uploads into my database. As I mentioned earlier I'm very new to Python and therefore it could be an easily solvable issue that I've overlooked. I've also redacted some personal connection information from the code. Any help would be appreciated, thanks.
You are getting such error because trying to execute query without any values for inserting.
If you read the doc about dl.to_sql before using of it you would know that this method writes records to database and returns None.
So, there is no needed trying to construct own sql query for inserting data.

How to query a (Postgres) RDS DB through an AWS Jupyter Notebook?

I'm trying to query an RDS (Postgres) database through Python, more specifically a Jupyter Notebook. Overall, what I've been trying for now is:
import boto3
client = boto3.client('rds-data')
response = client.execute_sql(
awsSecretStoreArn='string',
database='string',
dbClusterOrInstanceArn='string',
schema='string',
sqlStatements='string'
)
The error I've been receiving is:
BadRequestException: An error occurred (BadRequestException) when calling the ExecuteSql operation: ERROR: invalid cluster id: arn:aws:rds:us-east-1:839600708595:db:zprime
In the end, it was much simpler than I thought, nothing fancy or specific. It was basically a solution I had used before when accessing one of my local DBs. Simply import a specific library for your database type (Postgres, MySQL, etc) and then connect to it in order to execute queries through python.
I don't know if it will be the best solution since making queries through python will probably be much slower than doing them directly, but it's what works for now.
import psycopg2
conn = psycopg2.connect(database = 'database_name',
user = 'user',
password = 'password',
host = 'host',
port = 'port')
cur = conn.cursor()
cur.execute('''
SELECT *
FROM table;
''')
cur.fetchall()

How to execute pure SQL query in Python

I am trying to just create a temporary table in my SQL database, where I then want to insert data (from a Pandas DataFrame), and via this temporary table insert the data into a 'permanent' table within the database.
So far I have something like
""" Database specific... """
import sqlalchemy
from sqlalchemy.sql import text
dsn = 'dsn-sql-acc'
database = "MY_DATABASE"
connection_str = """
Driver={SQL Server Native Client 11.0};
Server=%s;
Database=%s;
Trusted_Connection=yes;
""" % (dsn,database)
connection_str_url = urllib.quote_plus(connection_str)
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % connection_str_url, encoding='utf8', echo=True)
# Open connection
db_connection = engine.connect()
sql_create_table = text("""
IF OBJECT_ID('[MY_DATABASE].[SCHEMA_1].[TEMP_TABLE]', 'U') IS NOT NULL
DROP TABLE [MY_DATABASE].[SCHEMA_1].[TEMP_TABLE];
CREATE TABLE [MY_DATABASE].[SCHEMA_1].[TEMP_TABLE] (
[Date] Date,
[TYPE_ID] nvarchar(50),
[VALUE] nvarchar(50)
);
""")
db_connection.execute("commit")
db_connection.execute(sql_create_table)
db_connection.close()
The "raw" SQL-snippet within sql_create_table works fine when executed in SQL Server, but when running the above in Python, nothing happens in my database...
What seems to be the issue here?
Later on I would of course want to execute
BULK INSERT [MY_DATABASE].[SCHEMA_1].[TEMP_TABLE]
FROM '//temp_files/temp_file_data.csv'
WITH (FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR='\n');
in Python as well...
Thanks
These statements are out of order:
db_connection.execute("commit")
db_connection.execute(sql_create_table)
Commit after creating your table and your table will persist.

How to query a MySQL database via python using peewee/mysqldb?

I'm creating an iOS client for App.net and I'm attempting to setup a push notification server. Currently my app can add a user's App.net account id (a string of numbers) and a APNS device token to a MySQL database on my server. It can also remove this data. I've adapted code from these two tutorials:
How To Write A Simple PHP/MySQL Web Service for an iOS App - raywenderlich.com
Apple Push Notification Services in iOS 6 Tutorial: Part 1/2 - raywenderlich.com
In addition, I've adapted this awesome python script to listen in to App.net's App Stream API.
My python is horrendous, as is my MySQL knowledge. What I'm trying to do is access the APNS device token for the accounts I need to notify. My database table has two fields/columns for each entry, one for user_id and a one for device_token. I'm not sure of the terminology, please let me know if I can clarify this.
I've been trying to use peewee to read from the database but I'm in way over my head. This is a test script with placeholder user_id:
import logging
from pprint import pprint
import peewee
from peewee import *
db = peewee.MySQLDatabase("...", host="localhost", user="...", passwd="...")
class MySQLModel(peewee.Model):
class Meta:
database = db
class Active_Users(MySQLModel):
user_id = peewee.CharField(primary_key=True)
device_token = peewee.CharField()
db.connect()
# This is the placeholder user_id
userID = '1234'
token = Active_Users.select().where(Active_Users.user_id == userID)
pprint(token)
This then prints out:
<class '__main__.User'> SELECT t1.`id`, t1.`user_id`, t1.`device_token` FROM `user` AS t1 WHERE (t1.`user_id` = %s) [u'1234']
If the code didn't make it clear, I'm trying to query the database for the row with the user_id of '1234' and I want to store the device_token of the same row (again, probably the wrong terminology) into a variable that I can use when I send the push notification later on in the script.
How do I correctly return the device_token? Also, would it be easier to forgo peewee and simply query the database using python-mysqldb? If that is the case, how would I go about doing that?
The call User.select().where(User.user_id == userID) returns a User object but you are assigning it to a variable called token as you're expecting just the device_token.
Your assignment should be this:
matching_users = Active_Users.select().where(Active_Users.user_id == userID) # returns an array of matching users even if there's just one
if matching_users is not None:
token = matching_users[0].device_token

Create a table from query results in Google BigQuery

We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.
You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}
The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()
Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.

Categories