Python BigQuery Client Caching does not work

Python BigQuery Client Caching does not work - python

It seems like the caching for the Python bigquery library does not work. The example below always prints None. How can I fix this issue?
from google.cloud import bigquery
query = """
SELECT *
FROM (SELECT 1)
"""
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache = True
results = bq_client.query(query, job_config=job_config)
print(results.cache_hit)

Turns out you gotta do something with the results object first. For example results.to_dataframe().
Thx to VictorGGI and William Funks:
"cache_hit" will return "None if job is not yet complete" (cache_hit). You would need to run "done" (done) to verify job is completed, or any other method which verified this

Related

Cloud Storage to BigQuery via Python

I am receiving a data drop into my GCS bucket daily and have a cloud function that moves said csv data to a BigQuery Table (see code below).
import datetime
def load_table_uri_csv(table_id):
# [START bigquery_load_table_gcs_csv]
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "dataSet.dataTable"
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
uri = "gs://client-data/team/looker-client-" + str(datetime.date.today()) + ".csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
# [END bigquery_load_table_gcs_csv]
However, the data comes with a 2 day look back resulting in repeated data in the BigQuery table.
Is there a way for me to update this cloud function to only pull in the most recent date from the csv once it is dropped off? This way I can easily avoid duplicative data within the reporting.
Or, maybe theres a way for me to run a scheduled query via BigQuery to resolve this?
For reference, the date column within the CSV comes in a TIMESTAMP schema.
Any and all help is appreciated!

There is seems to be no way to do this directly from Google Cloud Platform, unfortunately. You will need filter your information somehow before loading it.
You could review the information from the CSV in your code or through another medium.
It's also possible to submit a feature request for Google to consider this functionality.

How to make the copy command continue its run in redshift even after the lambda function which initiated it has timed out?

I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code
from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *
def lambda_handler(event, context):
con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
try:
query = """BEGIN TRANSACTION;
COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;
END TRANSACTION;"""
print(query)
cur.execute(query)
except Exception as e:
subject = "Error emr copy: {}".format(str(datetime.now().date()))
body = "Exception occured " + str(e)
print(body)
con.close()
This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.
I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.
SET statement_timeout to 18000000;
Can someone suggest how do I solve this issue?

The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.
What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.
The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.
But, there's a better option: the Redshift Data API, which is supported by Boto3.
This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).

I recommend using Redshift Data API in lambda to load data into Redshift from S3.
You can get rid of psycopgs2 package and use built-in boto3 package in lambda.
This will run copy query asynchronously and lambda function won't take more than a few seconds to run it.
I use sentry_sdk to get notifications of runtime error from lambda.
import boto3
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaaaa#aaaa.ingest.sentry.io/aaaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def execute_redshift_query(sql):
data_client = boto3.client('redshift-data')
data_client.execute_statement(
ClusterIdentifier='redshift-cluster-test',
Database='db',
DbUser='db_user',
Sql=sql,
StatementName='Test query',
WithEvent=True,
)
def handler(event, context):
query = """
copy schema.test_table
from 's3://test-bucket/test.csv'
IAM_ROLE 'arn:aws:iam::1234567890:role/TestRole'
region 'us-east-1'
ignoreheader 1 csv delimiter ','
"""
execute_redshift_query(query)
return True
And another lambda function to send error notification if copy query fails.
You can add EventBridge lambda trigger using the rule in screenshot below.
Here is lambda code to send error notification.
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaa#aaa.ingest.sentry.io/aaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def lambda_handler(event, context):
try:
if event["detail"]["state"] != "FINISHED":
raise ValueError(str(event))
except Exception as e:
sentry_sdk.capture_exception(e)
return True
You can identify which copy query failed by using StatementName defined in the first lambda function.
Hope it is helpful.

Cannot set destination table with BigQuery Python API

I'm recently getting the following BigQuery error when using the Python API:
google.api_core.exceptions.BadRequest: 400 configuration.query.destinationTable cannot be set for scripts
This is the function I use:
def execute_bigquery_sql(query, dataset_id, table_id, use_legacy_sql=True, write_disposition='WRITE_TRUNCATE'):
client = bigquery.Client()
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = use_legacy_sql
print("table_id: {table_id}".format(table_id=table_id))
print("dataset_id: {dataset_id}".format(dataset_id=dataset_id))
if table_id:
table_ref = client.dataset(dataset_id).table(table_id)
print("table_ref: {table_ref}".format(table_ref=table_ref))
job_config.destination = table_ref
job_config.write_disposition = write_disposition
job_config.allow_large_results = True
job_config.createDisposition = "CREATE_IF_NEEDED"
query_job = client.query(query,job_config=job_config)
results = query_job.result() # Waits for job to complete.
Does anyone knows what might be happening and a workaround?

Thanks for the responses, comments were indeed in the right direction. In BigQuery scripting means https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting.
And that is what is not allowed, and my query had:
DECLARE capital int64 default 10000000;
So, removing the line above was the fix in my case.
Interesting thing is that even in the web interface
If you use scripts the interface won't allow to save to table:
In contrast, when not using script statements, you should see:

error is self-descriptive. scripts do not allow destination table to be set - instead you should use DML/DDL
workaround is to reset job_config with no destination table in it

Multiple UPDATE queries in Google BigQuery using python

I'm trying to run multiple BigQuery queries via the python API but it seems like not all of them are being done, my guess is that I'm not giving them enough time to complete. What I have is something like:
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
client.query(query1)
query2 = "UPDATE ..."
client.query(query2)
query3 = "UPDATE ..."
client.query(query3)
My solution so far is inserting a sleep command before each client call, it works but it is kind of slow. Any hints or tips about how to do it on a more efficient way are appreciated.

BigQuery uses asynchronous jobs for queries. That means you can submit your query but it hasn’t necessarily finished yet. You can wait for it to finish if the next update requires the previous one to be complete.
If they aren’t tightly coupled, just send all of your updates and then wait at the end and all should complete. The return object of a query is that job, so you can use that to check the status of any of your update commands.
More info

If you just insert query_job.result() it will wait until the query is executed.
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
query_job = client.query(query1)
query_job.result()
query2 = "UPDATE ..."
query_job = client.query(query2)
query_job.result()
query3 = "UPDATE ..."
query_job = client.query(query3)
query_job.result()
Personally I would rather write it a bit more compact
from google.cloud import bigquery
client = bigquery.Client()
queries = ["UPDATE ...", "UPDATE ...", "UPDATE ..."]
for query in queries:
query_job = client.query(query)
query_job.result()

If you return the job to a variable, you can use job.state or job.error_result to determine if the job has finished. Then you can do something like:
j = client.query(query1)
while j.state == 'RUNNING':
time.sleep(10)
if j.error_result is not None:
print(j.error_result)
break
Can also use the ";" to run queries one after another
e.g. query1 = "select * from table1; select * from table2"

Now that BigQuery Scripting has been released, you can simply run all UPDATEs as a script:
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
query1 += "UPDATE ..."
query1 += "UPDATE ..."
client.query(query1)

Create a table from query results in Google BigQuery

We're using Google BigQuery via the Python API. How would I create a table (new one or overwrite old one) from query results? I reviewed the query documentation, but I didn't find it useful.
We want to simulate:
"SELEC ... INTO ..." from ANSI SQL.

You can do this by specifying a destination table in the query. You would need to use the Jobs.insert API rather than the Jobs.query call, and you should specify writeDisposition=WRITE_APPEND and fill out the destination table.
Here is what the configuration would look like, if you were using the raw API. If you're using Python, the Python client should give accessors to these same fields:
"configuration": {
"query": {
"query": "select count(*) from foo.bar",
"destinationTable": {
"projectId": "my_project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
}
}

The accepted answer is correct, but it does not provide Python code to perform the task. Here is an example, refactored out of a small custom client class I just wrote. It does not handle exceptions, and the hard-coded query should be customised to do something more interesting than just SELECT * ...
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
class Client(object):
def __init__(self, origin_project, origin_dataset, origin_table,
destination_dataset, destination_table):
"""
A Client that performs a hardcoded SELECT and INSERTS the results in a
user-specified location.
All init args are strings. Note that the destination project is the
default project from your Google Cloud configuration.
"""
self.project = origin_project
self.dataset = origin_dataset
self.table = origin_table
self.dest_dataset = destination_dataset
self.dest_table_name = destination_table
self.client = bigquery.Client()
def run(self):
query = ("SELECT * FROM `{project}.{dataset}.{table}`;".format(
project=self.project, dataset=self.dataset, table=self.table))
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
destination_dataset = self.client.dataset(self.dest_dataset)
destination_table = destination_dataset.table(self.dest_table_name)
job_config.destination = destination_table
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = self.client.query(query, job_config=job_config)
# Wait for the query to finish
job.result()

Create a table from query results in Google BigQuery. Assuming you are using Jupyter Notebook with Python 3 going to explain the following steps:
How to create a new dataset on BQ (to save the results)
How to run a query and save the results in a new dataset in table format on BQ
Create a new DataSet on BQ: my_dataset
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
Run a query on BQ using Python:
There are 2 types here:
Allowing Large Results
Query without mentioning large result etc.
I am taking the Public dataset here: bigquery-public-data:hacker_news & Table id: comments to run a query.
Allowing Large Results
DestinationTableName='table_id1' #Enter new table name you want to give
!bq query --allow_large_results --destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments]'
This query will allow large query results if required.
Without mentioning --allow_large_results:
DestinationTableName='table_id2' #Enter new table name you want to give
!bq query destination_table=project_id:my_dataset.$DestinationTableName 'SELECT * FROM [bigquery-public-data:hacker_news.comments] LIMIT 100'
This will work for the query where the result is not going to cross the limit mentioned in Google BQ documentation.
Output:
A new dataset on BQ with the name my_dataset
Results of the queries saved as tables in my_dataset
Note:
These queries are Commands which you can run on the terminal(without ! in the beginning). But as we are using Python to run these commands/queries we are using !. This will enable us to use/run commands in the Python program as well.
Also please upvote the answer :). Thank You.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.