I'm trying to run multiple BigQuery queries via the python API but it seems like not all of them are being done, my guess is that I'm not giving them enough time to complete. What I have is something like:
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
client.query(query1)
query2 = "UPDATE ..."
client.query(query2)
query3 = "UPDATE ..."
client.query(query3)
My solution so far is inserting a sleep command before each client call, it works but it is kind of slow. Any hints or tips about how to do it on a more efficient way are appreciated.
BigQuery uses asynchronous jobs for queries. That means you can submit your query but it hasn’t necessarily finished yet. You can wait for it to finish if the next update requires the previous one to be complete.
If they aren’t tightly coupled, just send all of your updates and then wait at the end and all should complete. The return object of a query is that job, so you can use that to check the status of any of your update commands.
More info
If you just insert query_job.result() it will wait until the query is executed.
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
query_job = client.query(query1)
query_job.result()
query2 = "UPDATE ..."
query_job = client.query(query2)
query_job.result()
query3 = "UPDATE ..."
query_job = client.query(query3)
query_job.result()
Personally I would rather write it a bit more compact
from google.cloud import bigquery
client = bigquery.Client()
queries = ["UPDATE ...", "UPDATE ...", "UPDATE ..."]
for query in queries:
query_job = client.query(query)
query_job.result()
If you return the job to a variable, you can use job.state or job.error_result to determine if the job has finished. Then you can do something like:
j = client.query(query1)
while j.state == 'RUNNING':
time.sleep(10)
if j.error_result is not None:
print(j.error_result)
break
Can also use the ";" to run queries one after another
e.g. query1 = "select * from table1; select * from table2"
Now that BigQuery Scripting has been released, you can simply run all UPDATEs as a script:
from google.cloud import bigquery
client = bigquery.Client()
query1 = "UPDATE ..."
query1 += "UPDATE ..."
query1 += "UPDATE ..."
client.query(query1)
Related
I'm trying to use the python module: redshift_connector to create a materialized view in a postgresql redshift database. My query is akin to:
query = """CREATE MATERIALIZED VIEW some_results AS SELECT * FROM other_results;"""
I pass this into the function below and nothing happens, however if I log into the database via terminal and run the same commands it works. Is this a permissions thing, or is there something wrong with my connector code? I can read from tables in the database with the connector without an issue.
def redshift_query(query)
conn = redshift_connector.connect(
host=XXXX,
port=5439,
database=XXXX,
user=XXXX,
password=XXXX)
print('Connected to Redshift...')
with conn.cursor() as cursor:
print('Querying Database...')
t1 = time.time()
cursor.execute(query)
t2 = time.time()
print(f'command sent, this took {t2-t1}s')
pass```
I have written my query in this way,i wanted to pass project name dynamically,i also defined project name in json files and in variable .py file
def bq_count_check(**context):
client = bigquery.Client()
# Perform a query.
query = """SELECT COUNT(*) AS COUNT1 FROM {{params.project_name}}.mydataset.tablename"""
query_job = client.query(query) # API request
result = query_job.result() # Waits for query to finish
for row in result:
#count check with results, return task_id
if row['COUNT1'] > 0:
return "do_task"
else:
return "SKIP_TASK_AND_FINISH"
I have a personal project to create a Telegram bot using python. What I want is to reply to any question with a dynamic answer generate from a database query. I don't want to create data query for every request from bot, so my idea is to generate a set of data (data frame) and then bot can take the answer from there. To generate the data frame, I want to schedule/reload the part of the querying script for every x minutes. My goal is to create Python script which can reload only on querying data without reloading the whole script. is there any ways to do this?
Sample code:
tt = datetime.now()
dsn_tns = cx_Oracle.makedsn(----)
conn = cx_Oracle.connect(user=----, password=----, dsn=dsn_tns)
cursor = conn.cursor()
sql = ("""select *
from TABLE
WHERE REPORTDATE > to_date(:tt,'DD-MM-YYYY HH24:MI:SS')""")
param = {"tt": tt}
data = psql.read_sql(sql,conn)#,params = param)
conn.close()
x = 2314 #value from question via bot
answer = data[(data['number'] == x))]
The part I want to reload regularly is from tt until conn.close().
I'm not sure why you don't want to rerun the query for each bot request, this would make more sense. It seems like you could also have missing data if you do not update your data for each bot request.
However, you can just wrap the code between tt and conn.close() in a function which you can set to run periodically.
def update_data()
global data
tt = datetime.now()
dsn_tns = cx_Oracle.makedsn(----)
conn = cx_Oracle.connect(user=----, password=----, dsn=dsn_tns)
cursor = conn.cursor()
sql = ("""select *
from TABLE
WHERE REPORTDATE > to_date(:tt,'DD-MM-YYYY HH24:MI:SS')""")
param = {"tt": tt}
data = psql.read_sql(sql,conn)#,params = param)
conn.close()
I'm recently getting the following BigQuery error when using the Python API:
google.api_core.exceptions.BadRequest: 400 configuration.query.destinationTable cannot be set for scripts
This is the function I use:
def execute_bigquery_sql(query, dataset_id, table_id, use_legacy_sql=True, write_disposition='WRITE_TRUNCATE'):
client = bigquery.Client()
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = use_legacy_sql
print("table_id: {table_id}".format(table_id=table_id))
print("dataset_id: {dataset_id}".format(dataset_id=dataset_id))
if table_id:
table_ref = client.dataset(dataset_id).table(table_id)
print("table_ref: {table_ref}".format(table_ref=table_ref))
job_config.destination = table_ref
job_config.write_disposition = write_disposition
job_config.allow_large_results = True
job_config.createDisposition = "CREATE_IF_NEEDED"
query_job = client.query(query,job_config=job_config)
results = query_job.result() # Waits for job to complete.
Does anyone knows what might be happening and a workaround?
Thanks for the responses, comments were indeed in the right direction. In BigQuery scripting means https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting.
And that is what is not allowed, and my query had:
DECLARE capital int64 default 10000000;
So, removing the line above was the fix in my case.
Interesting thing is that even in the web interface
If you use scripts the interface won't allow to save to table:
In contrast, when not using script statements, you should see:
error is self-descriptive. scripts do not allow destination table to be set - instead you should use DML/DDL
workaround is to reset job_config with no destination table in it
It seems like the caching for the Python bigquery library does not work. The example below always prints None. How can I fix this issue?
from google.cloud import bigquery
query = """
SELECT *
FROM (SELECT 1)
"""
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache = True
results = bq_client.query(query, job_config=job_config)
print(results.cache_hit)
Turns out you gotta do something with the results object first. For example results.to_dataframe().
Thx to VictorGGI and William Funks:
"cache_hit" will return "None if job is not yet complete" (cache_hit). You would need to run "done" (done) to verify job is completed, or any other method which verified this