I trying to load data that I have in a pandas data frame into a Redshift cluster using AWS lambda. I can't use a connector with the redshift endpoint url because the current VPC setup doesn't allow this connection.
What I can do is use the Redshift Data API. Which works like this:
redshift_data_api_client = boto3.client('redshift-data')
redshift_database = 'my_redshift_db'
DbUser = 'my_user'
ClusterIdentifier = 'my_redshift_cluster'
query = '''
select *
from some_schema.some_table t
limit 10
'''
res = redshift_data_api_client.execute_statement(
Database=redshift_database, DbUser=DbUser, Sql=query,
ClusterIdentifier=ClusterIdentifier)
time.sleep(10)
query_id = res["Id"]
response = redshift_data_api_client.get_statement_result(
Id=query_id)
The problem is that I haven't been able to integrate the Redshift Data API with a pandas dataframe. Ideally, I would like to be able to do something like:
redshift_data_api_client.insert_from_pandas(table, my_dataframe)
If that's not an option, I'd like to generate the INSERT SQL statement as string from the data frame, so I could do:
insert_query = my_dataframe.get_insert_sql_statement()
res = redshift_data_api_client.execute_statement(
Database=redshift_database,
DbUser=DbUser,
Sql=insert_query,
ClusterIdentifier=ClusterIdentifier)
But I couldn't find a way to do that either. Pandas has a to_sql function, but it sends the data directly to a db connection (which I don't have), it doesn't generate the INSERT statement as string.
Any help would be greatly appreciated :)
Related
I am receiving a data drop into my GCS bucket daily and have a cloud function that moves said csv data to a BigQuery Table (see code below).
import datetime
def load_table_uri_csv(table_id):
# [START bigquery_load_table_gcs_csv]
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "dataSet.dataTable"
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
uri = "gs://client-data/team/looker-client-" + str(datetime.date.today()) + ".csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
# [END bigquery_load_table_gcs_csv]
However, the data comes with a 2 day look back resulting in repeated data in the BigQuery table.
Is there a way for me to update this cloud function to only pull in the most recent date from the csv once it is dropped off? This way I can easily avoid duplicative data within the reporting.
Or, maybe theres a way for me to run a scheduled query via BigQuery to resolve this?
For reference, the date column within the CSV comes in a TIMESTAMP schema.
Any and all help is appreciated!
There is seems to be no way to do this directly from Google Cloud Platform, unfortunately. You will need filter your information somehow before loading it.
You could review the information from the CSV in your code or through another medium.
It's also possible to submit a feature request for Google to consider this functionality.
While fetching around 890 000 rows from MySQL database each with 60-65 columns using pd.read_sql() the connection(created by SQLAlchemy engine) is dropped before the query is formed. Is there any other way to optimize fetching for this amount of data because I do need all the rows and columns and I would like to get rid of the Exception.
Here is a code snippet:
import pandas as pd
def read_outputs(engine):
data = dict()
with engine.connect() as conn:
data['tbl_1']= pd.read_sql('tbl_1',con=conn).to_json()
data['tbl_2']= pd.read_sql('tbl_2',con=conn).to_json()
data['tbl_3']= pd.read_sql('tbl_3',con=conn).to_json()
engine.dispose()
return {'data':data}
Increase the default timeout using: "SET GLOBAL connect_timeout" property
conn.query('SET GLOBAL connect_timeout="<desired time>")
You can also set the timeout when creating the engine as:
create_engine(db_url, connect_args={'connect_timeout': <desired time>})
Firstly this is my first post on StackOverflow and so if I haven't structured my post properly, please let me know. Basically, I'm new to Python but I've been trying to connect an API to Python, from Python to a database that is hosted online, and then finally into a visualization package. I'm running into some problems when inserting the API data (Sheffield Solar) from Python into my database. The data does actually upload to the database but I'm struggling with an error message that I get in Python.
from datetime import datetime, date
import pytz
import psycopg2
import sqlalchemy
from pandas import DataFrame
from pvlive_api import PVLive
from sqlalchemy import create_engine, Integer, String, DATETIME, FLOAT
def insert_data():
""" Connect to the PostgreSQL database server """
# Calling the class from the pvlive_api.py file
data = PVLive()
# Gets the data between the two dates from the API and converts the output into a dataframe
dl = data.between(datetime(2019, 4, 5, 10, 30, tzinfo=pytz.utc),
datetime(2020, 4, 5, 14, 0, tzinfo=pytz.utc), entity_type="pes",
entity_id=0, dataframe=True)
# sql is used to insert the API data into the database table
sql = """INSERT INTO sheffield (pes_id, datetime_gmt, generation_mw) VALUES (%s, %s, %s)"""
uri = "Redacted"
print('Connecting to the PostgreSQL database...')
engine = create_engine(
'postgresql+psycopg2://Redacted')
# connect to the PostgreSQL server
conn = psycopg2.connect(uri)
# create a cursor that allows python code to execute Postgresql commands
cur = conn.cursor()
# Converts the data from a dataframe to an sql readable format, it also appends new data to the table, also
# prevents the index from being included in the table
into_db = dl.to_sql('sheffield', engine, if_exists='append', index=False)
cur.execute(sql, into_db)
# Commits any changes to ensure they actually happen
conn.commit()
# close the communication with the PostgreSQL
cur.close()
def main():
insert_data()
if __name__ == "__main__":
main()
The error I'm getting is as follows:
psycopg2.errors.SyntaxError: syntax error at or near "%"
LINE 1: ...eld (pes_id, datetime_gmt, generation_mw) VALUES (%s, %s, %s...
with the ^ pointing at the first %s. I'm assuming that the issue is due to me using into_db as my second argument in cur.execute(), however, as I mentioned earlier the data still uploads into my database. As I mentioned earlier I'm very new to Python and therefore it could be an easily solvable issue that I've overlooked. I've also redacted some personal connection information from the code. Any help would be appreciated, thanks.
You are getting such error because trying to execute query without any values for inserting.
If you read the doc about dl.to_sql before using of it you would know that this method writes records to database and returns None.
So, there is no needed trying to construct own sql query for inserting data.
We have data in a Snowflake cloud database that we would like to move into an Oracle database. As we would like to work toward refreshing the Oracle database regularly, I am trying to use SQLAlchemy to automate this.
I would like to do this using Core because my team is all experienced with SQL, but I am the only one with Python experience. I think it would be easier to tweak the data pulls if we just pass SQL strings. Plus the Snowflake db has some columns with JSON that seems easier to parse using direct SQL since I do not see JSON in the SnowflakeDialect.
I have established connections to both databases and am able to do select queries from both. I have also manually created the tables in our Oracle db so that the keys and datatypes match what I am pulling from Snowflake. When I try to insert, though, my Jupyter notebook just continuously says "Executing Cell" and hangs. Any thoughts on how to proceed or how to get the notebook to tell me where the hangup is?
from sqlalchemy import create_engine,pool,MetaData,text
from snowflake.sqlalchemy import URL
import pandas as pd
eng_sf = create_engine(URL( #engine for snowflake
account = 'account'
user = 'user'
password = 'password'
database = 'database'
schema = 'schema'
warehouse = 'warehouse'
role = 'role'
timezone = 'timezone'
))
eng_o = create_engine("oracle+cx_oracle://{}[{}]:{}#{}".format('user','proxy','password','database'),poolclass=pool.NullPool) #engine for oracle
meta_o = MetaData()
meta_o.reflect(bind=eng_o)
person_o = meta_o['bb_lms_person'] # other oracle tables follow this example
meta_sf = MetaData()
meta_sf.reflect(bind=eng_sf,only=['person']) # other snowflake tables as well, but for simplicity, let's look at one
person_sf = meta_sf.tables['person']
person_query = """
SELECT ID
,EMAIL
,STAGE:student_id::STRING as STUDENT_ID
,ROW_INSERTED_TIME
,ROW_UPDATED_TIME
,ROW_DELETED_TIME
FROM cdm_lms.PERSON
"""
with eng_sf.begin() as connection:
result = connection.execute(text(person_query)).fetchall() # this snippet runs and returns result as expected
with eng_o.begin() as connection:
connection.execute(person_o.insert(),result) # this is a coinflip, sometimes it runs, sometimes it just hangs 5ever
eng_sf.dispose()
eng_o.dispose()
I've checked the typical offenders. The keys for both person_o and the result are all lowercase and match. Any guidance would be appreciated.
use the metadata for the table. the fTable_Stage update or inserted as fluent functions and assign values to lambda variables. This is very safe because only metadata field variables can be used in the lambda. I am updating three fields:LateProbabilityDNN, Sentiment_Polarity, Sentiment_Subjectivity
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
connection=engine.connect()
metadata=MetaData()
Session = sessionmaker(bind = engine)
session = Session()
fTable_Stage=Table('fTable_Stage', metadata,autoload=True,autoload_with=engine)
stmt=fTable_Stage.update().where(fTable_Stage.c.KeyID==keyID).values(\
LateProbabilityDNN=round(float(late_proba),2),\
Sentiment_Polarity=round(my_valance.sentiment.polarity,2),\
Sentiment_Subjectivity= round(my_valance.sentiment.subjectivity,2)\
)
connection.execute(stmt)
We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.
You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}
The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()
Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.